388,617 research outputs found

    Sparse temporal difference learning via alternating direction method of multipliers

    Get PDF
    Recent work in off-line Reinforcement Learning has focused on efficient algorithms to incorporate feature selection, via 1-regularization, into the Bellman operator fixed-point estimators. These developments now mean that over-fitting can be avoided when the number of samples is small compared to the number of features. However, it remains unclear whether existing algorithms have the ability to offer good approximations for the task of policy evaluation and improvement. In this paper, we propose a new algorithm for approximating the fixed-point based on the Alternating Direction Method of Multipliers (ADMM). We demonstrate, with experimental results, that the proposed algorithm is more stable for policy iteration compared to prior work. Furthermore, we also derive a theoretical result that states the proposed algorithm obtains a solution which satisfies the optimality conditions for the fixed-point problem

    Hindsight-DICE: Stable Credit Assignment for Deep Reinforcement Learning

    Full text link
    Oftentimes, environments for sequential decision-making problems can be quite sparse in the provision of evaluative feedback to guide reinforcement-learning agents. In the extreme case, long trajectories of behavior are merely punctuated with a single terminal feedback signal, engendering a significant temporal delay between the observation of non-trivial reward and the individual steps of behavior culpable for eliciting such feedback. Coping with such a credit assignment challenge is one of the hallmark characteristics of reinforcement learning and, in this work, we capitalize on existing importance-sampling ratio estimation techniques for off-policy evaluation to drastically improve the handling of credit assignment with policy-gradient methods. While the use of so-called hindsight policies offers a principled mechanism for reweighting on-policy data by saliency to the observed trajectory return, naively applying importance sampling results in unstable or excessively lagged learning. In contrast, our hindsight distribution correction facilitates stable, efficient learning across a broad range of environments where credit assignment plagues baseline methods

    Cloud Index Tracking: Enabling Predictable Costs in Cloud Spot Markets

    Full text link
    Cloud spot markets rent VMs for a variable price that is typically much lower than the price of on-demand VMs, which makes them attractive for a wide range of large-scale applications. However, applications that run on spot VMs suffer from cost uncertainty, since spot prices fluctuate, in part, based on supply, demand, or both. The difficulty in predicting spot prices affects users and applications: the former cannot effectively plan their IT expenditures, while the latter cannot infer the availability and performance of spot VMs, which are a function of their variable price. To address the problem, we use properties of cloud infrastructure and workloads to show that prices become more stable and predictable as they are aggregated together. We leverage this observation to define an aggregate index price for spot VMs that serves as a reference for what users should expect to pay. We show that, even when the spot prices for individual VMs are volatile, the index price remains stable and predictable. We then introduce cloud index tracking: a migration policy that tracks the index price to ensure applications running on spot VMs incur a predictable cost by migrating to a new spot VM if the current VM's price significantly deviates from the index price.Comment: ACM Symposium on Cloud Computing 201

    United Nations Development Assistance Framework for Kenya

    Get PDF
    The United Nations Development Assistance Framework (2014-2018) for Kenya is an expression of the UN's commitment to support the Kenyan people in their self-articulated development aspirations. This UNDAF has been developed according to the principles of UN Delivering as One (DaO), aimed at ensuring Government ownership, demonstrated through UNDAF's full alignment to Government priorities and planning cycles, as well as internal coherence among UN agencies and programmes operating in Kenya. The UNDAF narrative includes five recommended sections: Introduction and Country Context, UNDAF Results, Resource Estimates, Implementation Arrangements, and Monitoring and Evaluation as well as a Results and Resources Annex. Developed under the leadership of the Government, the UNDAF reflects the efforts of all UN agencies working in Kenya and is shaped by the five UNDG programming principles: Human Rights-based approach, gender equality, environmental sustainability, capacity development, and results based management. The UNDAF working groups have developed a truly broad-based Results Framework, in collaboration with Civil Society, donors and other partners. The UNDAF has four Strategic Results Areas: 1) Transformational Governance encompassing Policy and Institutional Frameworks; Democratic Participation and Human Rights; Devolution and Accountability; and Evidence-based Decision-making, 2) Human Capital Development comprised of Education and Learning; Health, including Water, Sanitation and Hygiene (WASH), Environmental Preservation, Food Availability and Nutrition; Multi-sectoral HIV and AIDS Response; and Social Protection, 3) Inclusive and Sustainable Economic Growth, with Improving the Business Environment; Strengthening Productive Sectors and Trade; and Promoting Job Creation, Skills Development and Improved Working Conditions, and 4) Environmental Sustainability, Land Management and Human Security including Policy and Legal Framework Development; and Peace, Community Security and Resilience. The UNDAF Results Areas are aligned with the three Pillars (Political, Social and Economic) of the Government's Vision 2030 transformational agenda

    CrossNorm: Normalization for Off-Policy TD Reinforcement Learning

    Full text link
    Off-policy temporal difference (TD) methods are a powerful class of reinforcement learning (RL) algorithms. Intriguingly, deep off-policy TD algorithms are not commonly used in combination with feature normalization techniques, despite positive effects of normalization in other domains. We show that naive application of existing normalization techniques is indeed not effective, but that well-designed normalization improves optimization stability and removes the necessity of target networks. In particular, we introduce a normalization based on a mixture of on- and off-policy transitions, which we call cross-normalization. It can be regarded as an extension of batch normalization that re-centers data for two different distributions, as present in off-policy learning. Applied to DDPG and TD3, cross-normalization improves over the state of the art across a range of MuJoCo benchmark tasks

    TextGAIL: Generative Adversarial Imitation Learning for Text Generation

    Full text link
    Generative Adversarial Networks (GANs) for text generation have recently received many criticisms, as they perform worse than their MLE counterparts. We suspect previous text GANs' inferior performance is due to the lack of a reliable guiding signal in their discriminators. To address this problem, we propose a generative adversarial imitation learning framework for text generation that uses large pre-trained language models to provide more reliable reward guidance. Our approach uses contrastive discriminator, and proximal policy optimization (PPO) to stabilize and improve text generation performance. For evaluation, we conduct experiments on a diverse set of unconditional and conditional text generation tasks. Experimental results show that TextGAIL achieves better performance in terms of both quality and diversity than the MLE baseline. We also validate our intuition that TextGAIL's discriminator demonstrates the capability of providing reasonable rewards with an additional task.Comment: AAAI 202
    corecore