704 research outputs found

    Client Selection for Federated Policy Optimization with Environment Heterogeneity

    Full text link
    The development of Policy Iteration (PI) has inspired many recent algorithms for Reinforcement Learning (RL), including several policy gradient methods that gained both theoretical soundness and empirical success on a variety of tasks. The theory of PI is rich in the context of centralized learning, but its study under the federated setting is still in the infant stage. This paper investigates the federated version of Approximate PI (API) and derives its error bound, taking into account the approximation error introduced by environment heterogeneity. We theoretically prove that a proper client selection scheme can reduce this error bound. Based on the theoretical result, we propose a client selection algorithm to alleviate the additional approximation error caused by environment heterogeneity. Experiment results show that the proposed algorithm outperforms other biased and unbiased client selection methods on the federated mountain car problem and the Mujoco Hopper problem by effectively selecting clients with a lower level of heterogeneity from the population distribution

    FedKL: Tackling Data Heterogeneity in Federated Reinforcement Learning by Penalizing KL Divergence

    Full text link
    As a distributed learning paradigm, Federated Learning (FL) faces the communication bottleneck issue due to many rounds of model synchronization and aggregation. Heterogeneous data further deteriorates the situation by causing slow convergence. Although the impact of data heterogeneity on supervised FL has been widely studied, the related investigation for Federated Reinforcement Learning (FRL) is still in its infancy. In this paper, we first define the type and level of data heterogeneity for policy gradient based FRL systems. By inspecting the connection between the global and local objective functions, we prove that local training can benefit the global objective, if the local update is properly penalized by the total variation (TV) distance between the local and global policies. A necessary condition for the global policy to be learn-able from the local policy is also derived, which is directly related to the heterogeneity level. Based on the theoretical result, a Kullback-Leibler (KL) divergence based penalty is proposed, which, different from the conventional method that penalizes the model divergence in the parameter space, directly constrains the model outputs in the distribution space. By jointly penalizing the divergence of the local policy from the global policy with a global penalty and constraining each iteration of the local training with a local penalty, the proposed method achieves a better trade-off between training speed (step size) and convergence. Experiment results on two popular RL experiment platforms demonstrate the advantage of the proposed algorithm over existing methods in accelerating and stabilizing the training process with heterogeneous data

    Decentralized and Dynamic Home Health Care Resource Scheduling Using an Agent-Based Model

    Get PDF
    The purpose of this thesis is to design an agent-based scheduling system, simulated in a dynamic environment that will reduce home healthcare service costs. The study focuses on situations where a health care agency needs to assign home visits among a group of independent healthcare practitioners. Each practitioner has different skill sets, time constraints, and cost structures, given the nature, time and location of each home visit. Each expects reasonable payment commensurate with their skill levels as well as the costs incurred. The healthcare agency in turn needs all planned visits performed by qualified practitioners while minimizing overall service costs. Decisions about scheduling are made both before and during the scheduling period, requiring the health care agency to respond to unexpected situations based on the latest scheduling information. This problem is examined in a multi-agent system environment where practitioners are modeled as self-interested agents. The study first analyzes the problem for insights into the combinatorial nature of such a problem occurring in a centralized environment, then discusses the decentralized and dynamic challenges. An iterated bidding mechanism is designed as the negotiation protocol for the system. The effectiveness of this system is evaluated through a computational study, with results showing the proposed multi-agent scheduling system is able to compute high quality schedules in the decentralized home healthcare environment. Following this, the system is also implemented in a simulation model that can accommodate unexpected situations. We presents different simulation scenarios which illustrate the process of how the system dynamically schedules incoming visits, and cost reduction can be observed from the results

    Large Ecosystem Service Benefits of Assisted Natural Regeneration

    Get PDF
    China manages the largest monoculture plantations in the world, with 24% being Chinese fir plantations. Maximizing the ecosystem services of Chinese fir plantations has important implications in global carbon cycle and biodiversity protection. Assisted natural regeneration (ANR) is a practice to convert degraded lands into more productive forests with great ecosystems services. However, the quantitative understanding of ANR ecosystem service benefits is very limited. We conducted a comprehensive field manipulation experiment to evaluate the ANR potentials. We quantified and compared key ecosystem services including surface runoff, sediment yield, dissolved organic carbon export, plant diversity, and aboveground carbon accumulation of ANR of secondary forests dominated by Castanopsis carlesii to that of Chinese fir and C. carlesii plantations. Our results showed that ANR of C. carlesii forest reduced surface runoff and sediment yield up to 50% compared with other young plantations in the first 3 years and substantially increased plant diversity. ANR also reduced the export of dissolved organic carbon by 60–90% in the first 2 years. Aboveground biomass of the young ANR forest was approximately 3–4 times of that of other young plantations, while aboveground biomass of mature ANR forests was approximately 1.4 times of that of mature Chinese fir plantations of the same age. If all Chinese fir plantations in China were replaced by ANR forests, potentially 0.7 Pg more carbon will be stored in aboveground in one rotation (25 years). The results indicate that ANR triggers positive feedbacks among soil and water conservation, biodiversity protection, and biomass accumulation and thereby enhances ecosystem services

    Multi-granularity Item-based Contrastive Recommendation

    Full text link
    Contrastive learning (CL) has shown its power in recommendation. However, most CL-based recommendation models build their CL tasks merely focusing on the user's aspects, ignoring the rich diverse information in items. In this work, we propose a novel Multi-granularity item-based contrastive learning (MicRec) framework for the matching stage (i.e., candidate generation) in recommendation, which systematically introduces multi-aspect item-related information to representation learning with CL. Specifically, we build three item-based CL tasks as a set of plug-and-play auxiliary objectives to capture item correlations in feature, semantic and session levels. The feature-level item CL aims to learn the fine-grained feature-level item correlations via items and their augmentations. The semantic-level item CL focuses on the coarse-grained semantic correlations between semantically related items. The session-level item CL highlights the global behavioral correlations of items from users' sequential behaviors in all sessions. In experiments, we conduct both offline and online evaluations on real-world datasets, verifying the effectiveness and universality of three proposed CL tasks. Currently, MicRec has been deployed on a real-world recommender system, affecting millions of users. The source code will be released in the future.Comment: 17 pages, under revie

    2,2,4,4-Tetra­phenyl-1,3-bis­(3,3,5,5-tetra­methyl-1,1-diphenyl-5-vinyl­trisilox­an-1-yl)­cyclo­disilazane

    Get PDF
    The title mol­ecule, C60H70N2O4Si8, lies on an inversion center. In the asymmetric unit, one of the phenyl rings is disordered over two sets of sites with refined occupancies 0.58 (2) and 0.42 (2). In addition, in two substitution sites of the terminal dimeth­yl(vin­yl)silyl unit, a methyl group and the vinyl group are disordered over the same site with refined occupancies 0.523 (13) and 0.477 (13)

    3D-Properties: Identifying Challenges in DPO and Charting a Path Forward

    Full text link
    Aligning large language models (LLMs) with human preference has recently gained tremendous attention, with the canonical yet costly RLHF-PPO and the simple and straightforward Direct Preference Optimization (DPO) as two examples. Despite the efficiency, DPO has rarely be used in the state-of-the-art production-level LLMs, implying its potential pathologies. In this work, we revisit DPO with a comprehensive examination of its empirical efficacy and a systematic comparison with RLHF-PPO. We identify the \textbf{3D}-properties of DPO's learning outcomes: the \textbf{D}rastic drop in the likelihood of rejected responses, the \textbf{D}egradation into LLM unlearning, and the \textbf{D}ispersion effect on unseen responses through experiments with both a carefully designed toy model and practical LLMs on tasks including mathematical problem-solving and instruction following. These findings inherently connect to some observations made by related works and we additionally contribute a plausible theoretical explanation for them. Accordingly, we propose easy regularization methods to mitigate the issues caused by \textbf{3D}-properties, improving the training stability and final performance of DPO. Our contributions also include an investigation into how the distribution of the paired preference data impacts the effectiveness of DPO. We hope this work could offer research directions to narrow the gap between reward-free preference learning methods and reward-based ones
    corecore