1,785 research outputs found
Simulated Experince Evaluation in Developing Multi-agent Coordination Graphs
Cognitive science has proposed that a way people learn is through self-critiquing by generating \u27what-if\u27 strategies for events (simulation). It is theorized that people use this method to learn something new as well as to learn more quickly. This research adds this concept to a graph-based genetic program. Memories are recorded during fitness assessment and retained in a global memory bank based on the magnitude of change in the agent’s energy and age of the memory. Between generations, candidate agents perform in simulations of the stored memories. Candidates that perform similarly to good memories and differently from bad memories are more likely to be included in the next generation. The simulation-informed genetic program is evaluated in two domains: sequence matching and Robocode. Results indicate the algorithm does not perform equally in all environments. In sequence matching, experiential evaluation fails to perform better than the control. However, in Robocode, the experiential evaluation method initially outperforms the control then stagnates and often regresses. This is likely an indication that the algorithm is over-learning a single solution rather than adapting to the environment and that learning through simulation includes a satisficing component
Real-Time Bidding with Multi-Agent Reinforcement Learning in Display Advertising
Real-time advertising allows advertisers to bid for each impression for a
visiting user. To optimize specific goals such as maximizing revenue and return
on investment (ROI) led by ad placements, advertisers not only need to estimate
the relevance between the ads and user's interests, but most importantly
require a strategic response with respect to other advertisers bidding in the
market. In this paper, we formulate bidding optimization with multi-agent
reinforcement learning. To deal with a large number of advertisers, we propose
a clustering method and assign each cluster with a strategic bidding agent. A
practical Distributed Coordinated Multi-Agent Bidding (DCMAB) has been proposed
and implemented to balance the tradeoff between the competition and cooperation
among advertisers. The empirical study on our industry-scaled real-world data
has demonstrated the effectiveness of our methods. Our results show
cluster-based bidding would largely outperform single-agent and bandit
approaches, and the coordinated bidding achieves better overall objectives than
purely self-interested bidding agents
Sample-Efficient Reinforcement Learning of Robot Control Policies in the Real World
abstract: The goal of reinforcement learning is to enable systems to autonomously solve tasks in the real world, even in the absence of prior data. To succeed in such situations, reinforcement learning algorithms collect new experience through interactions with the environment to further the learning process. The behaviour is optimized by maximizing a reward function, which assigns high numerical values to desired behaviours. Especially in robotics, such interactions with the environment are expensive in terms of the required execution time, human involvement, and mechanical degradation of the system itself. Therefore, this thesis aims to introduce sample-efficient reinforcement learning methods which are applicable to real-world settings and control tasks such as bimanual manipulation and locomotion. Sample efficiency is achieved through directed exploration, either by using dimensionality reduction or trajectory optimization methods. Finally, it is demonstrated how data-efficient reinforcement learning methods can be used to optimize the behaviour and morphology of robots at the same time.Dissertation/ThesisDoctoral Dissertation Computer Science 201
Indirect Methods for Robot Skill Learning
Robot learning algorithms are appealing alternatives for acquiring rational robotic behaviors from data collected during the execution of tasks. Furthermore, most robot learning techniques are stated as isolated stages and focused on directly obtaining rational policies as a result of optimizing only performance measures of single tasks. However, formulating robotic skill acquisition processes in such a way have some disadvantages. For example, if the same skill has to be learned by different robots, independent learning processes should be carried out for acquiring exclusive policies for each robot. Similarly, if a robot has to learn diverse skills, the robot should acquire the policy for each task in separate learning processes, in a sequential order and commonly starting from scratch. In the same way, formulating the learning process in terms of only the performance measure, makes robots to unintentionally avoid situations that should not be repeated, but without any mechanism that captures the necessity of not repeating those wrong behaviors. In contrast, humans and other animals exploit their experience not only for improving the performance of the task they are currently executing, but for constructing indirectly multiple models to help them with that particular task and to generalize to new problems. Accordingly, the models and algorithms proposed in this thesis seek to be more data efficient and extract more information from the interaction data that is collected either from expert\u2019s demonstrations or the robot\u2019s own experience. The first approach encodes robotic skills with shared latent variable models, obtaining latent representations that can be transferred from one robot to others, therefore avoiding to learn the same task from scratch. The second approach learns complex rational policies by representing them as hierarchical models that can perform multiple concurrent tasks, and whose components are learned in the same learning process, instead of separate processes. Finally, the third approach uses the interaction data for learning two alternative and antagonistic policies that capture what to and not to do, and which influence the learning process in addition to the performance measure defined for the task
Shapley Q-value: A Local Reward Approach to Solve Global Reward Games
Cooperative game is a critical research area in the multi-agent reinforcement
learning (MARL). Global reward game is a subclass of cooperative games, where
all agents aim to maximize the global reward. Credit assignment is an important
problem studied in the global reward game. Most of previous works stood by the
view of non-cooperative-game theoretical framework with the shared reward
approach, i.e., each agent being assigned a shared global reward directly.
This, however, may give each agent an inaccurate reward on its contribution to
the group, which could cause inefficient learning. To deal with this problem,
we i) introduce a cooperative-game theoretical framework called extended convex
game (ECG) that is a superset of global reward game, and ii) propose a local
reward approach called Shapley Q-value. Shapley Q-value is able to distribute
the global reward, reflecting each agent's own contribution in contrast to the
shared reward approach. Moreover, we derive an MARL algorithm called Shapley
Q-value deep deterministic policy gradient (SQDDPG), using Shapley Q-value as
the critic for each agent. We evaluate SQDDPG on Cooperative Navigation,
Prey-and-Predator and Traffic Junction, compared with the state-of-the-art
algorithms, e.g., MADDPG, COMA, Independent DDPG and Independent A2C. In the
experiments, SQDDPG shows a significant improvement on the convergence rate.
Finally, we plot Shapley Q-value and validate the property of fair credit
assignment
- …