388,617 research outputs found
Sparse temporal difference learning via alternating direction method of multipliers
Recent work in off-line Reinforcement Learning has focused on efficient algorithms to incorporate feature selection, via 1-regularization, into the Bellman operator fixed-point estimators. These developments now mean that over-fitting can be avoided when the number of samples is small compared to the number of features. However, it remains unclear whether existing algorithms have the ability to offer good approximations for the task of policy evaluation and improvement. In this paper, we propose a new algorithm for approximating the fixed-point based on the Alternating Direction Method of Multipliers (ADMM). We demonstrate, with experimental results, that the proposed algorithm is more stable for policy iteration compared to prior work. Furthermore, we also derive a theoretical result that states the proposed algorithm obtains a solution which satisfies the optimality conditions for the fixed-point problem
Hindsight-DICE: Stable Credit Assignment for Deep Reinforcement Learning
Oftentimes, environments for sequential decision-making problems can be quite
sparse in the provision of evaluative feedback to guide reinforcement-learning
agents. In the extreme case, long trajectories of behavior are merely
punctuated with a single terminal feedback signal, engendering a significant
temporal delay between the observation of non-trivial reward and the individual
steps of behavior culpable for eliciting such feedback. Coping with such a
credit assignment challenge is one of the hallmark characteristics of
reinforcement learning and, in this work, we capitalize on existing
importance-sampling ratio estimation techniques for off-policy evaluation to
drastically improve the handling of credit assignment with policy-gradient
methods. While the use of so-called hindsight policies offers a principled
mechanism for reweighting on-policy data by saliency to the observed trajectory
return, naively applying importance sampling results in unstable or excessively
lagged learning. In contrast, our hindsight distribution correction facilitates
stable, efficient learning across a broad range of environments where credit
assignment plagues baseline methods
Cloud Index Tracking: Enabling Predictable Costs in Cloud Spot Markets
Cloud spot markets rent VMs for a variable price that is typically much lower
than the price of on-demand VMs, which makes them attractive for a wide range
of large-scale applications. However, applications that run on spot VMs suffer
from cost uncertainty, since spot prices fluctuate, in part, based on supply,
demand, or both. The difficulty in predicting spot prices affects users and
applications: the former cannot effectively plan their IT expenditures, while
the latter cannot infer the availability and performance of spot VMs, which are
a function of their variable price. To address the problem, we use properties
of cloud infrastructure and workloads to show that prices become more stable
and predictable as they are aggregated together. We leverage this observation
to define an aggregate index price for spot VMs that serves as a reference for
what users should expect to pay. We show that, even when the spot prices for
individual VMs are volatile, the index price remains stable and predictable. We
then introduce cloud index tracking: a migration policy that tracks the index
price to ensure applications running on spot VMs incur a predictable cost by
migrating to a new spot VM if the current VM's price significantly deviates
from the index price.Comment: ACM Symposium on Cloud Computing 201
United Nations Development Assistance Framework for Kenya
The United Nations Development Assistance Framework (2014-2018) for Kenya is an expression of the UN's commitment to support the Kenyan people in their self-articulated development aspirations. This UNDAF has been developed according to the principles of UN Delivering as One (DaO), aimed at ensuring Government ownership, demonstrated through UNDAF's full alignment to Government priorities and planning cycles, as well as internal coherence among UN agencies and programmes operating in Kenya. The UNDAF narrative includes five recommended sections: Introduction and Country Context, UNDAF Results, Resource Estimates, Implementation Arrangements, and Monitoring and Evaluation as well as a Results and Resources Annex. Developed under the leadership of the Government, the UNDAF reflects the efforts of all UN agencies working in Kenya and is shaped by the five UNDG programming principles: Human Rights-based approach, gender equality, environmental sustainability, capacity development, and results based management. The UNDAF working groups have developed a truly broad-based Results Framework, in collaboration with Civil Society, donors and other partners. The UNDAF has four Strategic Results Areas: 1) Transformational Governance encompassing Policy and Institutional Frameworks; Democratic Participation and Human Rights; Devolution and Accountability; and Evidence-based Decision-making, 2) Human Capital Development comprised of Education and Learning; Health, including Water, Sanitation and Hygiene (WASH), Environmental Preservation, Food Availability and Nutrition; Multi-sectoral HIV and AIDS Response; and Social Protection, 3) Inclusive and Sustainable Economic Growth, with Improving the Business Environment; Strengthening Productive Sectors and Trade; and Promoting Job Creation, Skills Development and Improved Working Conditions, and 4) Environmental Sustainability, Land Management and Human Security including Policy and Legal Framework Development; and Peace, Community Security and Resilience. The UNDAF Results Areas are aligned with the three Pillars (Political, Social and Economic) of the Government's Vision 2030 transformational agenda
CrossNorm: Normalization for Off-Policy TD Reinforcement Learning
Off-policy temporal difference (TD) methods are a powerful class of
reinforcement learning (RL) algorithms. Intriguingly, deep off-policy TD
algorithms are not commonly used in combination with feature normalization
techniques, despite positive effects of normalization in other domains. We show
that naive application of existing normalization techniques is indeed not
effective, but that well-designed normalization improves optimization stability
and removes the necessity of target networks. In particular, we introduce a
normalization based on a mixture of on- and off-policy transitions, which we
call cross-normalization. It can be regarded as an extension of batch
normalization that re-centers data for two different distributions, as present
in off-policy learning. Applied to DDPG and TD3, cross-normalization improves
over the state of the art across a range of MuJoCo benchmark tasks
TextGAIL: Generative Adversarial Imitation Learning for Text Generation
Generative Adversarial Networks (GANs) for text generation have recently
received many criticisms, as they perform worse than their MLE counterparts. We
suspect previous text GANs' inferior performance is due to the lack of a
reliable guiding signal in their discriminators. To address this problem, we
propose a generative adversarial imitation learning framework for text
generation that uses large pre-trained language models to provide more reliable
reward guidance. Our approach uses contrastive discriminator, and proximal
policy optimization (PPO) to stabilize and improve text generation performance.
For evaluation, we conduct experiments on a diverse set of unconditional and
conditional text generation tasks. Experimental results show that TextGAIL
achieves better performance in terms of both quality and diversity than the MLE
baseline. We also validate our intuition that TextGAIL's discriminator
demonstrates the capability of providing reasonable rewards with an additional
task.Comment: AAAI 202
- …