372 research outputs found
A taxonomy for similarity metrics between Markov decision processes
Although the notion of task similarity is potentially interesting in a wide range of areas such as curriculum learning or automated planning, it has mostly been tied to transfer learning. Transfer is based on the idea of reusing the knowledge acquired in the learning of a set of source tasks to a new learning process in a target task, assuming that the target and source tasks are close enough. In recent years, transfer learning has succeeded in making reinforcement learning (RL) algorithms more efficient (e.g., by reducing the number of samples needed to achieve (near-)optimal performance). Transfer in RL is based on the core concept of similarity: whenever the tasks are similar, the transferred knowledge can be reused to solve the target task and significantly improve the learning performance. Therefore, the selection of good metrics to measure these similarities is a critical aspect when building transfer RL algorithms, especially when this knowledge is transferred from simulation to the real world. In the literature, there are many metrics to measure the similarity between MDPs, hence, many definitions of similarity or its complement distance have been considered. In this paper, we propose a categorization of these metrics and analyze the definitions of similarity proposed so far, taking into account such categorization. We also follow this taxonomy to survey the existing literature, as well as suggesting future directions for the construction of new metricsOpen Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This work has also been supported by the Madrid Government (Comunidad de Madrid-Spain) under the Multiannual Agreement with UC3M in the line of Excellence of University Professors (EPUC3M17), and in the context of the V PRICIT (Regional Programme of Research and Technological Innovation)S
A tutorial on recursive models for analyzing and predicting path choice behavior
The problem at the heart of this tutorial consists in modeling the path
choice behavior of network users. This problem has been extensively studied in
transportation science, where it is known as the route choice problem. In this
literature, individuals' choice of paths are typically predicted using discrete
choice models. This article is a tutorial on a specific category of discrete
choice models called recursive, and it makes three main contributions: First,
for the purpose of assisting future research on route choice, we provide a
comprehensive background on the problem, linking it to different fields
including inverse optimization and inverse reinforcement learning. Second, we
formally introduce the problem and the recursive modeling idea along with an
overview of existing models, their properties and applications. Third, we
extensively analyze illustrative examples from different angles so that a
novice reader can gain intuition on the problem and the advantages provided by
recursive models in comparison to path-based ones
Conservative Dual Policy Optimization for Efficient Model-Based Reinforcement Learning
Provably efficient Model-Based Reinforcement Learning (MBRL) based on
optimism or posterior sampling (PSRL) is ensured to attain the global
optimality asymptotically by introducing the complexity measure of the model.
However, the complexity might grow exponentially for the simplest nonlinear
models, where global convergence is impossible within finite iterations. When
the model suffers a large generalization error, which is quantitatively
measured by the model complexity, the uncertainty can be large. The sampled
model that current policy is greedily optimized upon will thus be unsettled,
resulting in aggressive policy updates and over-exploration. In this work, we
propose Conservative Dual Policy Optimization (CDPO) that involves a
Referential Update and a Conservative Update. The policy is first optimized
under a reference model, which imitates the mechanism of PSRL while offering
more stability. A conservative range of randomness is guaranteed by maximizing
the expectation of model value. Without harmful sampling procedures, CDPO can
still achieve the same regret as PSRL. More importantly, CDPO enjoys monotonic
policy improvement and global optimality simultaneously. Empirical results also
validate the exploration efficiency of CDPO.Comment: Published at NeurIPS 202
Structural Return Maximization for Reinforcement Learning
Batch Reinforcement Learning (RL) algorithms attempt to choose a policy from
a designer-provided class of policies given a fixed set of training data.
Choosing the policy which maximizes an estimate of return often leads to
over-fitting when only limited data is available, due to the size of the policy
class in relation to the amount of data available. In this work, we focus on
learning policy classes that are appropriately sized to the amount of data
available. We accomplish this by using the principle of Structural Risk
Minimization, from Statistical Learning Theory, which uses Rademacher
complexity to identify a policy class that maximizes a bound on the return of
the best policy in the chosen policy class, given the available data. Unlike
similar batch RL approaches, our bound on return requires only extremely weak
assumptions on the true system
Advancing Data-Efficiency in Reinforcement Learning
In many real-world applications, including traffic control, robotics and web system
configurations, we are confronted with real-time decision-making problems where
data is limited. Reinforcement Learning (RL) allows us to construct a mathematical
framework to solve sequential decision-making problems under uncertainty. Under
low-data constraints, RL agents must be able to quickly identify relevant information in the observations, and to quickly learn how to act in order attain their long-term objective. While recent advancements in RL have demonstrated impressive
achievements, the end-to-end approach they take favours autonomy and flexibility
at the expense of fast learning. To be of practical use, there is an undeniable need
to improve the data-efficiency of existing systems.
Ideal RL agents would possess an optimal way of representing their environment, combined with an efficient mechanism for propagating reward signals across
the state space. This thesis investigates the problem of data-efficiency in RL from
these two aforementioned perspectives. A deep overview of the different representation learning methods in use in RL is provided. The aim of this overview is to
categorise the different representation learning approaches and highlight the impact
of the representation on data-efficiency. Then, this framing is used to develop two
main research directions. The first problem focuses on learning a representation that
captures the geometry of the problem. An RL mechanism that uses a scalable feature learning on graphs method to learn such rich representations is introduced, ultimately leading to more efficient value function approximation. Secondly, ET (λ ),
an algorithm that improves credit assignment in stochastic environments by propagating reward information counterfactually is presented. ET (λ ) results in faster earning compared to traditional methods that rely solely on temporal credit assignment. Overall, this thesis shows how a structural representation encoding the geometry of the state space, and counterfactual credit assignment are key characteristics
for data-efficient RL
Scalable Inference of Customer Similarities from Interactions Data using Dirichlet Processes
Under the sociological theory of homophily, people who are similar to one
another are more likely to interact with one another. Marketers often have
access to data on interactions among customers from which, with homophily as a
guiding principle, inferences could be made about the underlying similarities.
However, larger networks face a quadratic explosion in the number of potential
interactions that need to be modeled. This scalability problem renders
probability models of social interactions computationally infeasible for all
but the smallest networks. In this paper we develop a probabilistic framework
for modeling customer interactions that is both grounded in the theory of
homophily, and is flexible enough to account for random variation in who
interacts with whom. In particular, we present a novel Bayesian nonparametric
approach, using Dirichlet processes, to moderate the scalability problems that
marketing researchers encounter when working with networked data. We find that
this framework is a powerful way to draw insights into latent similarities of
customers, and we discuss how marketers can apply these insights to
segmentation and targeting activities
Automating Software Customization via Crowdsourcing using Association Rule Mining and Markov Decision Processes
As systems grow in size and complexity so do their configuration possibilities. Users of modern systems are easy to be confused and overwhelmed by the amount of choices they need to make in order to fit their systems to their exact needs. In this thesis, we propose a technique to select what information to elicit from the user so that the system can recommend the maximum number of personalized configuration items. Our method is based on constructing configuration elicitation dialogs through utilizing crowd wisdom.
A set of configuration preferences in form of association rules is first mined from a crowd configuration data set. Possible configuration elicitation dialogs are then modeled through a Markov Decision Processes (MDPs). Within the model, association rules are used to automatically infer configuration decisions based on knowledge already elicited earlier in the dialog. This way, an MDP solver can search for elicitation strategies which maximize the expected amount of automated decisions, reducing thereby elicitation effort and increasing user confidence of the result. We conclude by reporting results of a case study in which this method is applied to the privacy configuration of Facebook
MULTIPOLAR: Multi-Source Policy Aggregation for Transfer Reinforcement Learning between Diverse Environmental Dynamics
Transfer reinforcement learning (RL) aims at improving the learning
efficiency of an agent by exploiting knowledge from other source agents trained
on relevant tasks. However, it remains challenging to transfer knowledge
between different environmental dynamics without having access to the source
environments. In this work, we explore a new challenge in transfer RL, where
only a set of source policies collected under diverse unknown dynamics is
available for learning a target task efficiently. To address this problem, the
proposed approach, MULTI-source POLicy AggRegation (MULTIPOLAR), comprises two
key techniques. We learn to aggregate the actions provided by the source
policies adaptively to maximize the target task performance. Meanwhile, we
learn an auxiliary network that predicts residuals around the aggregated
actions, which ensures the target policy's expressiveness even when some of the
source policies perform poorly. We demonstrated the effectiveness of MULTIPOLAR
through an extensive experimental evaluation across six simulated environments
ranging from classic control problems to challenging robotics simulations,
under both continuous and discrete action spaces. The demo videos and code are
available on the project webpage: https://omron-sinicx.github.io/multipolar/.Comment: This work was presented at IJCAI 2020. Copyright (c) 2020
International Joint Conferences on Artificial Intelligence, All rights
reserve
- …