42 research outputs found

    Offline Policy Evaluation and Optimization under Confounding

    Full text link
    Evaluating and optimizing policies in the presence of unobserved confounders is a problem of growing interest in offline reinforcement learning. Using conventional methods for offline RL in the presence of confounding can not only lead to poor decisions and poor policies, but can also have disastrous effects in critical applications such as healthcare and education. We map out the landscape of offline policy evaluation for confounded MDPs, distinguishing assumptions on confounding based on their time-evolution and effect on the data-collection policies. We determine when consistent value estimates are not achievable, providing and discussing algorithms to estimate lower bounds with guarantees in those cases. When consistent estimates are achievable, we provide sample complexity guarantees. We also present new algorithms for offline policy improvement and prove local convergence guarantees. Finally, we experimentally evaluate our algorithms on gridworld and a simulated healthcare setting of managing sepsis patients. We note that in gridworld, our model-based method provides tighter lower bounds than existing methods, while in the sepsis simulator, our methods significantly outperform confounder-oblivious benchmarks

    General synthesis of 2D rare-earth oxide single crystals with tailorable facets

    Get PDF
    Two-dimensional (2D) rare-earth oxides (REOs) are a large family of materials with various intriguing applications and precise facet control is essential for investigating new properties in the 2D limit. However, a bottleneck remains with regard to obtaining their 2D single crystals with specific facets because of the intrinsic non-layered structure and disparate thermodynamic stability of different facets. Herein, for the first time, we achieve the synthesis of a wide variety of high-quality 2D REO single crystals with tailorable facets via designing a hard-soft-acid-base couple for controlling the 2D nucleation of the predetermined facets and adjusting the growth mode and direction of crystals. Also, the facet-related magnetic properties of 2D REO single crystals were revealed. Our approach provides a foundation for further exploring other facet-dependent properties and various applications of 2D REO, as well as inspiration for the precise growth of other non-layered 2D materials

    31st Annual Meeting and Associated Programs of the Society for Immunotherapy of Cancer (SITC 2016) : part two

    Get PDF
    Background The immunological escape of tumors represents one of the main ob- stacles to the treatment of malignancies. The blockade of PD-1 or CTLA-4 receptors represented a milestone in the history of immunotherapy. However, immune checkpoint inhibitors seem to be effective in specific cohorts of patients. It has been proposed that their efficacy relies on the presence of an immunological response. Thus, we hypothesized that disruption of the PD-L1/PD-1 axis would synergize with our oncolytic vaccine platform PeptiCRAd. Methods We used murine B16OVA in vivo tumor models and flow cytometry analysis to investigate the immunological background. Results First, we found that high-burden B16OVA tumors were refractory to combination immunotherapy. However, with a more aggressive schedule, tumors with a lower burden were more susceptible to the combination of PeptiCRAd and PD-L1 blockade. The therapy signifi- cantly increased the median survival of mice (Fig. 7). Interestingly, the reduced growth of contralaterally injected B16F10 cells sug- gested the presence of a long lasting immunological memory also against non-targeted antigens. Concerning the functional state of tumor infiltrating lymphocytes (TILs), we found that all the immune therapies would enhance the percentage of activated (PD-1pos TIM- 3neg) T lymphocytes and reduce the amount of exhausted (PD-1pos TIM-3pos) cells compared to placebo. As expected, we found that PeptiCRAd monotherapy could increase the number of antigen spe- cific CD8+ T cells compared to other treatments. However, only the combination with PD-L1 blockade could significantly increase the ra- tio between activated and exhausted pentamer positive cells (p= 0.0058), suggesting that by disrupting the PD-1/PD-L1 axis we could decrease the amount of dysfunctional antigen specific T cells. We ob- served that the anatomical location deeply influenced the state of CD4+ and CD8+ T lymphocytes. In fact, TIM-3 expression was in- creased by 2 fold on TILs compared to splenic and lymphoid T cells. In the CD8+ compartment, the expression of PD-1 on the surface seemed to be restricted to the tumor micro-environment, while CD4 + T cells had a high expression of PD-1 also in lymphoid organs. Interestingly, we found that the levels of PD-1 were significantly higher on CD8+ T cells than on CD4+ T cells into the tumor micro- environment (p < 0.0001). Conclusions In conclusion, we demonstrated that the efficacy of immune check- point inhibitors might be strongly enhanced by their combination with cancer vaccines. PeptiCRAd was able to increase the number of antigen-specific T cells and PD-L1 blockade prevented their exhaus- tion, resulting in long-lasting immunological memory and increased median survival

    Advances in Sequential Decision Making Problems with Causal and Low-Rank Structures

    Full text link
    Bandits and Markov Decision Processes are powerful sequential decision making paradigms that have been widely applied to solve real world problems. However, existing algorithms often suffer from high sample complexity due to the large action space. In this thesis, we present several contributions to reduce the sample complexity by exploiting the problem structure. In the first part, we study how to utilize the given causal information represented as a causal graph along with associated conditional distributions for bandit problems. We propose two algorithms, causal upper confidence bound (C-UCB) and causal Thompson Sampling (C-TS), that enjoy improved cumulative regret bounds compared with algorithms that do not use causal information. Further, we extend C-UCB and C-TS to the linear bandit setting. We also show that under certain causal structures, our algorithms scale better than the standard bandit algorithms as the number of interventions increases. In the second part, we further explore how to utilize the given causal information for Markov Decision Processes. We introduce causal Markov Decision Processes, a new formalism for sequential decision making which combines the standard Markov Decision Process formulation with causal structures over state transition and reward functions. We propose the causal upper confidence bound value iteration (C-UCBVI) algorithm that exploits the causal structure and improves the performance of standard reinforcement learning algorithms that do not take causal knowledge into account. To tackle the large state space problem in Markov Decision Process, we further formulate causal factored Markov Decision Process and design new algorithms with reduced regret. Lastly, we explore the connection between linear Markov Decision Process and causal Markov Decision Process. In the third part, we tackle the challenging setting where the causal information is unknown. We propose mild identifiability conditions and design new causal bandit algorithms for causal trees, causal forests and a general class of causal graphs. We prove that the regret guarantees of our algorithms greatly improve upon those of standard multi-armed bandit algorithms. Lastly, we prove our mild conditions are necessary: without them one cannot do better than standard bandit algorithms. In the fourth part, we investigate a challenging problem associated with the causal structure: unobserved confounders. We study to what extent the unobserved con- founders affect the estimation in the offline policy evaluation problem in reinforcement learning. We give the first minimax lower bound for error due to unobserved con- founder. We also analyze two algorithms and show they are minimax optimal. Lastly, we propose a new model-based method and show it is never worse than the model-free method proposed in prior work. In the last part, we explore another problem structure, the low-rank property of the ground truth parameter. We study linear bandits and generalized linear bandits, and we present algorithms via a novel combination of online-to-confidence-set conver- sion and the exponentially weighted average forecaster constructed by a covering of low-rank matrices. To get around the computational intractability of covering based approaches, we propose an efficient algorithm using the subspace exploration tech- nique. Our theoretical and empirical results demonstrate the effectiveness of utilizing the low-rank structures in reducing the regret.PHDStatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/174494/1/yylu_1.pd
    corecore