512,164 research outputs found

    Shapley Q-value: A Local Reward Approach to Solve Global Reward Games

    Full text link
    Cooperative game is a critical research area in the multi-agent reinforcement learning (MARL). Global reward game is a subclass of cooperative games, where all agents aim to maximize the global reward. Credit assignment is an important problem studied in the global reward game. Most of previous works stood by the view of non-cooperative-game theoretical framework with the shared reward approach, i.e., each agent being assigned a shared global reward directly. This, however, may give each agent an inaccurate reward on its contribution to the group, which could cause inefficient learning. To deal with this problem, we i) introduce a cooperative-game theoretical framework called extended convex game (ECG) that is a superset of global reward game, and ii) propose a local reward approach called Shapley Q-value. Shapley Q-value is able to distribute the global reward, reflecting each agent's own contribution in contrast to the shared reward approach. Moreover, we derive an MARL algorithm called Shapley Q-value deep deterministic policy gradient (SQDDPG), using Shapley Q-value as the critic for each agent. We evaluate SQDDPG on Cooperative Navigation, Prey-and-Predator and Traffic Junction, compared with the state-of-the-art algorithms, e.g., MADDPG, COMA, Independent DDPG and Independent A2C. In the experiments, SQDDPG shows a significant improvement on the convergence rate. Finally, we plot Shapley Q-value and validate the property of fair credit assignment

    The excellence in research for Australia Scheme: An evaluation of the draft journal weights for economics

    Get PDF
    In February 2008, the Australian government announced its intention to develop a new quality and evaluation system for research conducted at the nation’s universities. Although the Excellence in Research for Australia (ERA) scheme will utilize several measures to evaluate institutional performance, we have chosen to focus on one element only: the assessment of refereed journal article output based on ERA’s own journal weighting scheme. The ERA weighting scheme will undoubtedly shape the reward structure facing university administrators and individual academics. Our objective is to explore the nature of the ERA weighting scheme for economics, and to demonstrate how it impacts on departmental and individual researcher rankings relative to rankings generated by alternative schemes employed in the economics literature. In order to do so, we utilize data from New Zealand’s economics departments and the draft set of journal weights (DERA) released in August 2008 by ERA officials. Given the similarities between Australia and New Zealand, our findings should have relevance to the Australian scene. As a result, we hope to provide the reader with a better understanding of the type of research activity that influences DERA rankings at both the departmental and individual level

    Rewarding Sequential Innovators: Patents Prizes and Buyouts

    Get PDF
    This paper presents a model of cumulative innovation where firms are heterogeneous in their research ability. We study the optimal reward policy when the quality of the ideas and their subsequent development effort are private information. The optimal assignment of property rights must counterbalance the incentives of current and future innovators. The resulting mechanism resembles a menu of patents that, contrary to the existing literature, have infinite duration and fixed scope, where the latter increases in the value of the idea. Finally, we provide a way to implement this patent menu by using a simple buyout scheme: The innovator commits at the outset to a price ceiling at which he will sell his rights to a future inventor. By paying a larger fee, a higher price ceiling is obtained. Any subsequent innovator must pay this price and purchase its own buyout fee contract.

    Rewarding sequential innovators: prizes, patents and buyouts

    Get PDF
    This paper presents a model of cumulative innovation where firms are heterogeneous in their research ability. We study the optimal reward policy when the quality of the ideas and their subsequent development effort are private information. The optimal assignment of property rights must counterbalance the incentives of current and future innovators. The resulting mechanism resembles a menu of patents that have infinite duration and fixed scope, where the latter increases in the value of the idea. Finally, we provide a way to implement this patent menu by using a simple buyout scheme: The innovator commits at the outset to a price ceiling at which he will sell his rights to a future inventor. By paying a larger fee initially, a higher price ceiling is obtained. Any subsequent innovator must pay this price and purchase its own buyout fee contract.Patents

    A Unified Theory of Dual-Process Control

    Full text link
    Dual-process theories play a central role in both psychology and neuroscience, figuring prominently in fields ranging from executive control to reward-based learning to judgment and decision making. In each of these domains, two mechanisms appear to operate concurrently, one relatively high in computational complexity, the other relatively simple. Why is neural information processing organized in this way? We propose an answer to this question based on the notion of compression. The key insight is that dual-process structure can enhance adaptive behavior by allowing an agent to minimize the description length of its own behavior. We apply a single model based on this observation to findings from research on executive control, reward-based learning, and judgment and decision making, showing that seemingly diverse dual-process phenomena can be understood as domain-specific consequences of a single underlying set of computational principles

    CHARACTERISTICS ULU AL-ALBA

    Get PDF
    Ulu al-alba>b in the Qur'an it is stated that it has its own special features. This research aims to determine the characteristicsulu al-alba>b in the Qur'an in QS al-Ra'd/13: 19-24. This research uses descriptive analysis with a content analysis approach. The technique used in data collection is library research (library review). The method used in this research is tahlili. The results of this research show that ulu al-alba>b in QS al-Ra'd/13: 19-24 has special characteristics, namely stead fastness in fulfilling promises, connecting human ties and patience in worship.ulu al-alba>b in verse 24 it is stated that he will receive heaven's reward with special honor from the angels

    SuperHF: Supervised Iterative Learning from Human Feedback

    Full text link
    While large language models demonstrate remarkable capabilities, they often present challenges in terms of safety, alignment with human values, and stability during training. Here, we focus on two prevalent methods used to align these models, Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF). SFT is simple and robust, powering a host of open-source models, while RLHF is a more sophisticated method used in top-tier models like ChatGPT but also suffers from instability and susceptibility to reward hacking. We propose a novel approach, Supervised Iterative Learning from Human Feedback (SuperHF), which seeks to leverage the strengths of both methods. Our hypothesis is two-fold: that the reward model used in RLHF is critical for efficient data use and model generalization and that the use of Proximal Policy Optimization (PPO) in RLHF may not be necessary and could contribute to instability issues. SuperHF replaces PPO with a simple supervised loss and a Kullback-Leibler (KL) divergence prior. It creates its own training data by repeatedly sampling a batch of model outputs and filtering them through the reward model in an online learning regime. We then break down the reward optimization problem into three components: robustly optimizing the training rewards themselves, preventing reward hacking-exploitation of the reward model that degrades model performance-as measured by a novel METEOR similarity metric, and maintaining good performance on downstream evaluations. Our experimental results show SuperHF exceeds PPO-based RLHF on the training objective, easily and favorably trades off high reward with low reward hacking, improves downstream calibration, and performs the same on our GPT-4 based qualitative evaluation scheme all the while being significantly simpler to implement, highlighting SuperHF's potential as a competitive language model alignment technique.Comment: Accepted to the Socially Responsible Language Modelling Research (SoLaR) workshop at NeurIPS 202

    Semantics and pragmatics of anti-proverbs

    Get PDF
    A proverb is a short and simple saying which is widely known and which condenses common sense, experience, expectations and wisdom of mankind. An innovative alteration of a standard proverb, often used for humorous effect, is called an anti-proverb. For instance, by changing a single word, the well-known proverb Virtue is its own reward has transformed into anti-proverb Virtue is its own punishment. Such intentional transformations of traditional proverbs are gaining popularity in today’s world. Writers often use them for humorous or satirical purposes, in advertising, mass media and social networks. The aim of this research is to explore and identify the linguistic mechanisms which lead to creation of anti-proverbs. We will compare structural features of a number of anti-proverbs found in various sources with their traditional proverb counterparts. By means of such analysis we hope to identify underlying grammatical, semantic and contextual arrangements of anti-proverbs. This research will also include an analysis of reasons which cause linguistic innovation in the phenomenon of anti-proverbs, based on their utilization for advertising, journalistic or humorous purposes

    Carrot and Yardstick Regulation: Enhancing Market Performance with Output Prizes

    Full text link
    The fundamental objective of most regulatory mechanisms is to expand output at a sufficiently low cost to consumers. Many useable mechanisms, such as Loeb and Magat's, require detailed demand information and substantial profit recapture by the regulator in order to achieve this objective. We present an apparently unexplored alternative approach--inducing competition among firms for shares of a monetary reward. Payments to a firm for output expansion thus depend on both its own behavior and the actions of other firms, which can even be firms in unrelated industries. We show that in a wide variety of circumstances, the resultant increase in consumer surplus exceeds the reward. Hence, even with no profit recapture, our approach can lead to Pareto improvements.Center for Research on Economic and Social Theory, Department of Economics, University of Michiganhttp://deepblue.lib.umich.edu/bitstream/2027.42/100698/1/ECON016.pd
    corecore