Search CORE

10,783 research outputs found

Intrinsically Motivated Goal Exploration Processes with Automatic Curriculum Learning

Author: Forestier Sébastien
Mollard Yoan
Oudeyer Pierre-Yves
Portelas Rémy
Publication venue
Publication date: 24/07/2020
Field of study

Intrinsically motivated spontaneous exploration is a key enabler of autonomous lifelong learning in human children. It enables the discovery and acquisition of large repertoires of skills through self-generation, self-selection, self-ordering and self-experimentation of learning goals. We present an algorithmic approach called Intrinsically Motivated Goal Exploration Processes (IMGEP) to enable similar properties of autonomous or self-supervised learning in machines. The IMGEP algorithmic architecture relies on several principles: 1) self-generation of goals, generalized as fitness functions; 2) selection of goals based on intrinsic rewards; 3) exploration with incremental goal-parameterized policy search and exploitation of the gathered data with a batch learning algorithm; 4) systematic reuse of information acquired when targeting a goal for improving towards other goals. We present a particularly efficient form of IMGEP, called Modular Population-Based IMGEP, that uses a population-based policy and an object-centered modularity in goals and mutations. We provide several implementations of this architecture and demonstrate their ability to automatically generate a learning curriculum within several experimental setups including a real humanoid robot that can explore multiple spaces of goals with several hundred continuous dimensions. While no particular target goal is provided to the system, this curriculum allows the discovery of skills that act as stepping stone for learning more complex skills, e.g. nested tool use. We show that learning diverse spaces of goals with intrinsic motivations is more efficient for learning complex skills than only trying to directly learn these complex skills

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Learning Adaptive Display Exposure for Real-Time Advertising

Author: Agrawal Shipra
Andrychowicz Marcin
Aranyak Mehta
Bacon Pierre-Luc
Badanidiyuru Ashwinkumar
Hester Todd
Hu Yujing
Kulkarni Tejas D
Tang Liang
Wu Di
Wu Huasen
Zhang Weinan
Zhao Jun
Publication venue
Publication date: 02/09/2019
Field of study

In E-commerce advertising, where product recommendations and product ads are presented to users simultaneously, the traditional setting is to display ads at fixed positions. However, under such a setting, the advertising system loses the flexibility to control the number and positions of ads, resulting in sub-optimal platform revenue and user experience. Consequently, major e-commerce platforms (e.g., Taobao.com) have begun to consider more flexible ways to display ads. In this paper, we investigate the problem of advertising with adaptive exposure: can we dynamically determine the number and positions of ads for each user visit under certain business constraints so that the platform revenue can be increased? More specifically, we consider two types of constraints: request-level constraint ensures user experience for each user visit, and platform-level constraint controls the overall platform monetization rate. We model this problem as a Constrained Markov Decision Process with per-state constraint (psCMDP) and propose a constrained two-level reinforcement learning approach to decompose the original problem into two relatively independent sub-problems. To accelerate policy learning, we also devise a constrained hindsight experience replay mechanism. Experimental evaluations on industry-scale real-world datasets demonstrate the merits of our approach in both obtaining higher revenue under the constraints and the effectiveness of the constrained hindsight experience replay mechanism.Comment: accepted by CIKM201

arXiv.org e-Print Archive

Crossref

UCL Discovery

Perturbed-History Exploration in Stochastic Linear Bandits

Author: Boutilier Craig
Ghavamzadeh Mohammad
Kveton Branislav
Szepesvari Csaba
Publication venue
Publication date: 21/03/2019
Field of study

We propose a new online algorithm for minimizing the cumulative regret in stochastic linear bandits. The key idea is to build a perturbed history, which mixes the history of observed rewards with a pseudo-history of randomly generated i.i.d. pseudo-rewards. Our algorithm, perturbed-history exploration in a linear bandit (LinPHE), estimates a linear model from its perturbed history and pulls the arm with the highest value under that model. We prove a

\tilde{O}(d \sqrt{n})

gap-free bound on the expected

n

-round regret of LinPHE, where

d

is the number of features. Our analysis relies on novel concentration and anti-concentration bounds on the weighted sum of Bernoulli random variables. To show the generality of our design, we extend LinPHE to a logistic reward model. We evaluate both algorithms empirically and show that they are practical

arXiv.org e-Print Archive