72,279 research outputs found
Reinforcement learning with restrictions on the action set
Consider a 2-player normal-form game repeated over time. We introduce an
adaptive learning procedure, where the players only observe their own realized
payoff at each stage. We assume that agents do not know their own payoff
function, and have no information on the other player. Furthermore, we assume
that they have restrictions on their own action set such that, at each stage,
their choice is limited to a subset of their action set. We prove that the
empirical distributions of play converge to the set of Nash equilibria for
zero-sum and potential games, and games where one player has two actions.Comment: 28 page
Reward-Relevance-Filtered Linear Offline Reinforcement Learning
This paper studies offline reinforcement learning with linear function
approximation in a setting with decision-theoretic, but not estimation
sparsity. The structural restrictions of the data-generating process presume
that the transitions factor into a sparse component that affects the reward and
could affect additional exogenous dynamics that do not affect the reward.
Although the minimally sufficient adjustment set for estimation of full-state
transition properties depends on the whole state, the optimal policy and
therefore state-action value function depends only on the sparse component: we
call this causal/decision-theoretic sparsity. We develop a method for
reward-filtering the estimation of the state-action value function to the
sparse component by a modification of thresholded lasso in least-squares policy
evaluation. We provide theoretical guarantees for our reward-filtered linear
fitted-Q-iteration, with sample complexity depending only on the size of the
sparse component.Comment: conference version accepted at AISTATS 202
Intelligent Scheduling Method for Bulk Cargo Terminal Loading Process Based on Deep Reinforcement Learning
Funding Information: Funding: This research was funded by the National Natural Science Foundation of China under Grant U1964201 and Grant U21B6001, the Major Scientific and Technological Special Project of Hei-longjiang Province under Grant 2021ZX05A01, the Heilongjiang Natural Science Foundation under Grant LH2019F020, and the Major Scientific and Technological Research Project of Ningbo under Grant 2021Z040. Publisher Copyright: © 2022 by the authors. Licensee MDPI, Basel, Switzerland.Sea freight is one of the most important ways for the transportation and distribution of coal and other bulk cargo. This paper proposes a method for optimizing the scheduling efficiency of the bulk cargo loading process based on deep reinforcement learning. The process includes a large number of states and possible choices that need to be taken into account, which are currently performed by skillful scheduling engineers on site. In terms of modeling, we extracted important information based on actual working data of the terminal to form the state space of the model. The yard information and the demand information of the ship are also considered. The scheduling output of each convey path from the yard to the cabin is the action of the agent. To avoid conflicts of occupying one machine at same time, certain restrictions are placed on whether the action can be executed. Based on Double DQN, an improved deep reinforcement learning method is proposed with a fully connected network structure and selected action sets according to the value of the network and the occupancy status of environment. To make the network converge more quickly, an improved new epsilon-greedy exploration strategy is also proposed, which uses different exploration rates for completely random selection and feasible random selection of actions. After training, an improved scheduling result is obtained when the tasks arrive randomly and the yard state is random. An important contribution of this paper is to integrate the useful features of the working time of the bulk cargo terminal into a state set, divide the scheduling process into discrete actions, and then reduce the scheduling problem into simple inputs and outputs. Another major contribution of this article is the design of a reinforcement learning algorithm for the bulk cargo terminal scheduling problem, and the training efficiency of the proposed algorithm is improved, which provides a practical example for solving bulk cargo terminal scheduling problems using reinforcement learning.publishersversionpublishe
Answer Set Programming for Non-Stationary Markov Decision Processes
Non-stationary domains, where unforeseen changes happen, present a challenge
for agents to find an optimal policy for a sequential decision making problem.
This work investigates a solution to this problem that combines Markov Decision
Processes (MDP) and Reinforcement Learning (RL) with Answer Set Programming
(ASP) in a method we call ASP(RL). In this method, Answer Set Programming is
used to find the possible trajectories of an MDP, from where Reinforcement
Learning is applied to learn the optimal policy of the problem. Results show
that ASP(RL) is capable of efficiently finding the optimal solution of an MDP
representing non-stationary domains
Developing social action capabilities in a humanoid robot using an interaction history architecture
“This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder." “Copyright IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.” DOI: 10.1109/ICHR.2008.4756013We present experimental results for the humanoid robot Kaspar2 engaging in a simple “peekaboo” interaction game with a human partner. The robot develops the capability to engage in the game by using its history of interactions coupled with audio and visual feedback from the interaction partner to continually generate increasingly appropriate behaviour. The robot also uses facial expressions to feedback its level of reward to the partner. The results support the hypothesis that reinforcement of time-extended experiences through interaction allows a robot to act appropriately in an interaction
- …