93 research outputs found

    Stochastic Planning with Lifted Symbolic Trajectory Optimization

    Get PDF
    This paper investigates online stochastic planning for problems with large factored state and action spaces. One promising approach in recent work estimates the quality of applicable actions in the current state through aggregate simulation from the states they reach. This leads to significant speedup, compared to search over concrete states and actions, and suffices to guide decision making in cases where the performance of a random policy is informative of the quality of a state. The paper makes two significant improvements to this approach. The first, taking inspiration from lifted belief propagation, exploits the structure of the problem to derive a more compact computation graph for aggregate simulation. The second improvement replaces the random policy embedded in the computation graph with symbolic variables that are optimized simultaneously with the search for high quality actions. This expands the scope of the approach to problems that require deep search and where information is lost quickly with random steps. An empirical evaluation shows that these ideas significantly improve performance, leading to state of the art performance on hard planning problems

    Multi-Period Stochastic Resource Planning: Models, Algorithms and Applications

    Get PDF
    This research addresses the problem of sequential decision making in the presence of uncertainty in the professional service industry. Specifically, it considers the problem of dynamically assigning resources to tasks in a stochastic environment with both the uncertainty of resource availability due to attrition, and the uncertainty of job availability due to unknown project bid outcome. This problem is motivated by the resource planning application at the Hewlett Packard (HP) Enterprises. The challenge is to provide resource planning support over a time horizon under the influence of internal resource attrition and demand uncertainty. To ensure demand is satisfied, the external contingent resources can be engaged to make up for internal resource attrition. The objective is to maximize profitability by identifying the optimal mix of internal and contingent resources and their assignments to project tasks under explicit uncertainty. While the sequential decision problems under uncertainty can often be modeled as a Markov decision process (MDP), the classical dynamic programming (DP) method using the Bellman’s equation suffers the well-known curses-of-dimensionality and only works for small size instances. To tackle the challenge of curses-of-dimensionality this research focuses on developing computationally tractable closed-loop Approximate Dynamic Programming (ADP) algorithms to obtain near-optimal solutions in reasonable computational time. Various approximation schemes are developed to approximate the cost-to-go function. A comprehensive computational experiment is conducted to investigate the performance and behavior of the ADP algorithm. The performance of ADP is also compared with that of a rolling horizon approach as a benchmark solution. Computational results show that the optimization model and algorithm developed in this thesis are able to offer solutions with higher profitability and utilization of internal resource for companies in the professional service industry

    How to Fine-tune the Model: Unified Model Shift and Model Bias Policy Optimization

    Full text link
    Designing and deriving effective model-based reinforcement learning (MBRL) algorithms with a performance improvement guarantee is challenging, mainly attributed to the high coupling between model learning and policy optimization. Many prior methods that rely on return discrepancy to guide model learning ignore the impacts of model shift, which can lead to performance deterioration due to excessive model updates. Other methods use performance difference bound to explicitly consider model shift. However, these methods rely on a fixed threshold to constrain model shift, resulting in a heavy dependence on the threshold and a lack of adaptability during the training process. In this paper, we theoretically derive an optimization objective that can unify model shift and model bias and then formulate a fine-tuning process. This process adaptively adjusts the model updates to get a performance improvement guarantee while avoiding model overfitting. Based on these, we develop a straightforward algorithm USB-PO (Unified model Shift and model Bias Policy Optimization). Empirical results show that USB-PO achieves state-of-the-art performance on several challenging benchmark tasks

    Language representations for generalization in reinforcement learning

    Get PDF
    The choice of state and action representation in Reinforcement Learning (RL) has a significant effect on agent performance for the training task. But its relationship with generalization to new tasks is under-explored. One approach to improving generalization investigated here is the use of language as a representation. We compare vector-states and discreteactions to language representations. We find the agents using language representations generalize better and could solve tasks with more entities, new entities, and more complexity than seen in the training task. We attribute this to the compositionality of languag

    Provably Efficient Adversarial Imitation Learning with Unknown Transitions

    Full text link
    Imitation learning (IL) has proven to be an effective method for learning good policies from expert demonstrations. Adversarial imitation learning (AIL), a subset of IL methods, is particularly promising, but its theoretical foundation in the presence of unknown transitions has yet to be fully developed. This paper explores the theoretical underpinnings of AIL in this context, where the stochastic and uncertain nature of environment transitions presents a challenge. We examine the expert sample complexity and interaction complexity required to recover good policies. To this end, we establish a framework connecting reward-free exploration and AIL, and propose an algorithm, MB-TAIL, that achieves the minimax optimal expert sample complexity of O~(H3/2∣S∣/ε)\widetilde{O} (H^{3/2} |S|/\varepsilon) and interaction complexity of O~(H3∣S∣2∣A∣/ε2)\widetilde{O} (H^{3} |S|^2 |A|/\varepsilon^2). Here, HH represents the planning horizon, ∣S∣|S| is the state space size, ∣A∣|A| is the action space size, and ε\varepsilon is the desired imitation gap. MB-TAIL is the first algorithm to achieve this level of expert sample complexity in the unknown transition setting and improves upon the interaction complexity of the best-known algorithm, OAL, by O(H)O(H). Additionally, we demonstrate the generalization ability of MB-TAIL by extending it to the function approximation setting and proving that it can achieve expert sample and interaction complexity independent of ∣S∣|S|Comment: arXiv admin note: text overlap with arXiv:2106.1042

    Safe Model-Based Multi-Agent Mean-Field Reinforcement Learning

    Full text link
    Many applications, e.g., in shared mobility, require coordinating a large number of agents. Mean-field reinforcement learning addresses the resulting scalability challenge by optimizing the policy of a representative agent. In this paper, we address an important generalization where there exist global constraints on the distribution of agents (e.g., requiring capacity constraints or minimum coverage requirements to be met). We propose Safe-M3\text{M}^3-UCRL, the first model-based algorithm that attains safe policies even in the case of unknown transition dynamics. As a key ingredient, it uses epistemic uncertainty in the transition model within a log-barrier approach to ensure pessimistic constraints satisfaction with high probability. We showcase Safe-M3\text{M}^3-UCRL on the vehicle repositioning problem faced by many shared mobility operators and evaluate its performance through simulations built on Shenzhen taxi trajectory data. Our algorithm effectively meets the demand in critical areas while ensuring service accessibility in regions with low demand.Comment: 25 pages, 14 figures, 3 table

    Understanding and Addressing the Pitfalls of Bisimulation-based Representations in Offline Reinforcement Learning

    Full text link
    While bisimulation-based approaches hold promise for learning robust state representations for Reinforcement Learning (RL) tasks, their efficacy in offline RL tasks has not been up to par. In some instances, their performance has even significantly underperformed alternative methods. We aim to understand why bisimulation methods succeed in online settings, but falter in offline tasks. Our analysis reveals that missing transitions in the dataset are particularly harmful to the bisimulation principle, leading to ineffective estimation. We also shed light on the critical role of reward scaling in bounding the scale of bisimulation measurements and of the value error they induce. Based on these findings, we propose to apply the expectile operator for representation learning to our offline RL setting, which helps to prevent overfitting to incomplete data. Meanwhile, by introducing an appropriate reward scaling strategy, we avoid the risk of feature collapse in representation space. We implement these recommendations on two state-of-the-art bisimulation-based algorithms, MICo and SimSR, and demonstrate performance gains on two benchmark suites: D4RL and Visual D4RL. Codes are provided at \url{https://github.com/zanghyu/Offline_Bisimulation}.Comment: NeurIPS 202
    • …
    corecore