44 research outputs found

    Learning on a Budget Using Distributional RL

    Get PDF
    Agents acting in real-world scenarios often have constraints such as finite budgets or daily job performance targets. While repeated (episodic) tasks can be solved with existing RL algorithms, methods need to be extended if the repetition depends on performance. Recent work has introduced a distributional perspective on reinforcement learning, providing a model of episodic returns. Inspired by these results we contribute the new budget- and risk-aware distributional reinforcement learning (BRAD-RL) algorithm that bootstraps from the C51 distributional output and then uses value iteration to estimate the value of starting an episode with a certain amount of budget. With this strategy we can make budget-wise action selection within each episode and maximize the return across episodes. Experiments in a grid-world domain highlight the benefits of our algorithm, maximizing discounted future returns when low cumulative performance may terminate repetition

    An exploration strategy for non-stationary opponents

    Get PDF
    The success or failure of any learning algorithm is partially due to the exploration strategy it exerts. However, most exploration strategies assume that the environment is stationary and non-strategic. In this work we shed light on how to design exploration strategies in non-stationary and adversarial environments. Our proposed adversarial drift exploration (DE) is able to efficiently explore the state space while keeping track of regions of the environment that have changed. This proposed exploration is general enough to be applied in single agent non-stationary environments as well as in multiagent settings where the opponent changes its strategy in time. We use a two agent strategic interaction setting to test this new type of exploration, where the opponent switches between different behavioral patterns to emulate a non-deterministic, stochastic and adversarial environment. The agent’s objective is to learn a model of the opponent’s strategy to act optimally. Our contribution is twofold. First, we present DE as a strategy for switch detection. Second, we propose a new algorithm called R-max# for learning and planning against non-stationary opponent. To handle such opponents, R-max# reasons and acts in terms of two objectives: (1) to maximize utilities in the short term while learning and (2) eventually explore opponent behavioral changes. We provide theoretical results showing that R-max# is guaranteed to detect the opponent’s switch and learn a new model in terms of finite sample complexity. R-max# makes efficient use of exploration experiences, which results in rapid adaptation and efficient DE, to deal with the non-stationary nature of the opponent. We show experimentally how using DE outperforms the state of the art algorithms that were explicitly designed for modeling opponents (in terms average rewards) in two complimentary domains

    Efficiently detecting switches against non-stationary opponents

    Get PDF
    Interactions in multiagent systems are generally more complicated than single agent ones. Game theory provides solutions on how to act in multiagent scenarios; however, it assumes that all agents will act rationally. Moreover, some works also assume the opponent will use a stationary strategy. These assumptions usually do not hold in real world scenarios where agents have limited capacities and may deviate from a perfect rational response. Our goal is still to act optimally in these cases by learning the appropriate response and without any prior policies on how to act. Thus, we focus on the problem when another agent in the environment uses different stationary strategies over time. This will turn the problem into learning in a non-stationary environment, posing a problem for most learning algorithms. This paper introduces DriftER, an algorithm that (1) learns a model of the opponent, (2) uses that to obtain an optimal policy and then (3) determines when it must re-learn due to an opponent strategy change. We provide theoretical results showing that DriftER guarantees to detect switches with high probability. Also, we provide empirical results showing that our approach outperforms state of the art algorithms, in normal form games such as prisoner’s dilemma and then in a more realistic scenario, the Power TAC simulator

    A Survey of Learning in Multiagent Environments: Dealing with Non-Stationarity

    Get PDF
    The key challenge in multiagent learning is learning a best response to the behaviour of other agents, which may be non-stationary: if the other agents adapt their strategy as well, the learning target moves. Disparate streams of research have approached non-stationarity from several angles, which make a variety of implicit assumptions that make it hard to keep an overview of the state of the art and to validate the innovation and significance of new works. This survey presents a coherent overview of work that addresses opponent-induced non-stationarity with tools from game theory, reinforcement learning and multi-armed bandits. Further, we reflect on the principle approaches how algorithms model and cope with this non-stationarity, arriving at a new framework and five categories (in increasing order of sophistication): ignore, forget, respond to target models, learn models, and theory of mind. A wide range of state-of-the-art algorithms is classified into a taxonomy, using these categories and key characteristics of the environment (e.g., observability) and adaptation behaviour of the opponents (e.g., smooth, abrupt). To clarify even further we present illustrative variations of one domain, contrasting the strengths and limitations of each category. Finally, we discuss in which environments the different approaches yield most merit, and point to promising avenues of future research

    A Survey of Learning in Multiagent Environments: Dealing with Non-Stationarity

    Get PDF
    The key challenge in multiagent learning is learning a best response to the behaviour of other agents, which may be non-stationary: if the other agents adapt their strategy as well, the learning target moves. Disparate streams of research have approached non-stationarity from several angles, which make a variety of implicit assumptions that make it hard to keep an overview of the state of the art and to validate the innovation and significance of new works. This survey presents a coherent overview of work that addresses opponent-induced non-stationarity with tools from game theory, reinforcement learning and multi-armed bandits. Further, we reflect on the principle approaches how algorithms model and cope with this non-stationarity, arriving at a new framework and five categories (in increasing order of sophistication): ignore, forget, respond to target models, learn models, and theory of mind. A wide range of state-of-the-art algorithms is classified into a taxonomy, using these categories and key characteristics of the environment (e.g., observability) and adaptation behaviour of the opponents (e.g., smooth, abrupt). To clarify even further we present illustrative variations of one domain, contrasting the strengths and limitations of each category. Finally, we discuss in which environments the different approaches yield most merit, and point to promising avenues of future research

    Elective Cancer Surgery in COVID-19-Free Surgical Pathways During the SARS-CoV-2 Pandemic: An International, Multicenter, Comparative Cohort Study.

    Get PDF
    PURPOSE: As cancer surgery restarts after the first COVID-19 wave, health care providers urgently require data to determine where elective surgery is best performed. This study aimed to determine whether COVID-19-free surgical pathways were associated with lower postoperative pulmonary complication rates compared with hospitals with no defined pathway. PATIENTS AND METHODS: This international, multicenter cohort study included patients who underwent elective surgery for 10 solid cancer types without preoperative suspicion of SARS-CoV-2. Participating hospitals included patients from local emergence of SARS-CoV-2 until April 19, 2020. At the time of surgery, hospitals were defined as having a COVID-19-free surgical pathway (complete segregation of the operating theater, critical care, and inpatient ward areas) or no defined pathway (incomplete or no segregation, areas shared with patients with COVID-19). The primary outcome was 30-day postoperative pulmonary complications (pneumonia, acute respiratory distress syndrome, unexpected ventilation). RESULTS: Of 9,171 patients from 447 hospitals in 55 countries, 2,481 were operated on in COVID-19-free surgical pathways. Patients who underwent surgery within COVID-19-free surgical pathways were younger with fewer comorbidities than those in hospitals with no defined pathway but with similar proportions of major surgery. After adjustment, pulmonary complication rates were lower with COVID-19-free surgical pathways (2.2% v 4.9%; adjusted odds ratio [aOR], 0.62; 95% CI, 0.44 to 0.86). This was consistent in sensitivity analyses for low-risk patients (American Society of Anesthesiologists grade 1/2), propensity score-matched models, and patients with negative SARS-CoV-2 preoperative tests. The postoperative SARS-CoV-2 infection rate was also lower in COVID-19-free surgical pathways (2.1% v 3.6%; aOR, 0.53; 95% CI, 0.36 to 0.76). CONCLUSION: Within available resources, dedicated COVID-19-free surgical pathways should be established to provide safe elective cancer surgery during current and before future SARS-CoV-2 outbreaks

    Elective cancer surgery in COVID-19-free surgical pathways during the SARS-CoV-2 pandemic: An international, multicenter, comparative cohort study

    Get PDF
    PURPOSE As cancer surgery restarts after the first COVID-19 wave, health care providers urgently require data to determine where elective surgery is best performed. This study aimed to determine whether COVID-19–free surgical pathways were associated with lower postoperative pulmonary complication rates compared with hospitals with no defined pathway. PATIENTS AND METHODS This international, multicenter cohort study included patients who underwent elective surgery for 10 solid cancer types without preoperative suspicion of SARS-CoV-2. Participating hospitals included patients from local emergence of SARS-CoV-2 until April 19, 2020. At the time of surgery, hospitals were defined as having a COVID-19–free surgical pathway (complete segregation of the operating theater, critical care, and inpatient ward areas) or no defined pathway (incomplete or no segregation, areas shared with patients with COVID-19). The primary outcome was 30-day postoperative pulmonary complications (pneumonia, acute respiratory distress syndrome, unexpected ventilation). RESULTS Of 9,171 patients from 447 hospitals in 55 countries, 2,481 were operated on in COVID-19–free surgical pathways. Patients who underwent surgery within COVID-19–free surgical pathways were younger with fewer comorbidities than those in hospitals with no defined pathway but with similar proportions of major surgery. After adjustment, pulmonary complication rates were lower with COVID-19–free surgical pathways (2.2% v 4.9%; adjusted odds ratio [aOR], 0.62; 95% CI, 0.44 to 0.86). This was consistent in sensitivity analyses for low-risk patients (American Society of Anesthesiologists grade 1/2), propensity score–matched models, and patients with negative SARS-CoV-2 preoperative tests. The postoperative SARS-CoV-2 infection rate was also lower in COVID-19–free surgical pathways (2.1% v 3.6%; aOR, 0.53; 95% CI, 0.36 to 0.76). CONCLUSION Within available resources, dedicated COVID-19–free surgical pathways should be established to provide safe elective cancer surgery during current and before future SARS-CoV-2 outbreaks

    A global experiment on motivating social distancing during the COVID-19 pandemic

    Get PDF
    Finding communication strategies that effectively motivate social distancing continues to be a global public health priority during the COVID-19 pandemic. This cross-country, preregistered experiment (n = 25,718 from 89 countries) tested hypotheses concerning generalizable positive and negative outcomes of social distancing messages that promoted personal agency and reflective choices (i.e., an autonomy-supportive message) or were restrictive and shaming (i.e., a controlling message) compared with no message at all. Results partially supported experimental hypotheses in that the controlling message increased controlled motivation (a poorly internalized form of motivation relying on shame, guilt, and fear of social consequences) relative to no message. On the other hand, the autonomy-supportive message lowered feelings of defiance compared with the controlling message, but the controlling message did not differ from receiving no message at all. Unexpectedly, messages did not influence autonomous motivation (a highly internalized form of motivation relying on one’s core values) or behavioral intentions. Results supported hypothesized associations between people’s existing autonomous and controlled motivations and self-reported behavioral intentions to engage in social distancing. Controlled motivation was associated with more defiance and less long-term behavioral intention to engage in social distancing, whereas autonomous motivation was associated with less defiance and more short- and long-term intentions to social distance. Overall, this work highlights the potential harm of using shaming and pressuring language in public health communication, with implications for the current and future global health challenges
    corecore