346 research outputs found
Optimistic Exploration even with a Pessimistic Initialisation
Optimistic initialisation is an effective strategy for efficient exploration
in reinforcement learning (RL). In the tabular case, all provably efficient
model-free algorithms rely on it. However, model-free deep RL algorithms do not
use optimistic initialisation despite taking inspiration from these provably
efficient tabular algorithms. In particular, in scenarios with only positive
rewards, Q-values are initialised at their lowest possible values due to
commonly used network initialisation schemes, a pessimistic initialisation.
Merely initialising the network to output optimistic Q-values is not enough,
since we cannot ensure that they remain optimistic for novel state-action
pairs, which is crucial for exploration. We propose a simple count-based
augmentation to pessimistically initialised Q-values that separates the source
of optimism from the neural network. We show that this scheme is provably
efficient in the tabular setting and extend it to the deep RL setting. Our
algorithm, Optimistic Pessimistically Initialised Q-Learning (OPIQ), augments
the Q-value estimates of a DQN-based agent with count-derived bonuses to ensure
optimism during both action selection and bootstrapping. We show that OPIQ
outperforms non-optimistic DQN variants that utilise a pseudocount-based
intrinsic motivation in hard exploration tasks, and that it predicts optimistic
estimates for novel state-action pairs.Comment: Published as a conference paper at ICLR 202
Softmax exploration strategies for multiobjective reinforcement learning
Despite growing interest over recent years in applying reinforcement learning to multiobjective problems, there has been little research into the applicability and effectiveness of exploration strategies within the multiobjective context. This work considers several widely-used approaches to exploration from the single-objective reinforcement learning literature, and examines their incorporation into multiobjective Q-learning. In particular this paper proposes two novel approaches which extend the softmax operator to work with vector-valued rewards. The performance of these exploration strategies is evaluated across a set of benchmark environments. Issues arising from the multiobjective formulation of these benchmarks which impact on the performance of the exploration strategies are identified. It is shown that of the techniques considered, the combination of the novel softmax–epsilon exploration with optimistic initialisation provides the most effective trade-off between exploration and exploitation
Cautious Reinforcement Learning with Logical Constraints
This paper presents the concept of an adaptive safe padding that forces
Reinforcement Learning (RL) to synthesise optimal control policies while
ensuring safety during the learning process. Policies are synthesised to
satisfy a goal, expressed as a temporal logic formula, with maximal
probability. Enforcing the RL agent to stay safe during learning might limit
the exploration, however we show that the proposed architecture is able to
automatically handle the trade-off between efficient progress in exploration
(towards goal satisfaction) and ensuring safety. Theoretical guarantees are
available on the optimality of the synthesised policies and on the convergence
of the learning algorithm. Experimental results are provided to showcase the
performance of the proposed method.Comment: Accepted to AAMAS 2020. arXiv admin note: text overlap with
arXiv:1902.0077
Essays in economics and machine learning
This thesis studies questions in innovation, optimal policy, and aggregate fluctuations, partially with the help of methods from machine learning and artificial intelligence.
Chapter 1 is concerned with innovation in patents. With tools from natural language processing, we represent around 4.6 million patents as high dimensional numerical vectors and find a rich underlying structure. We measure economy wide and field specific trends and detect patents which anticipated such trends or widened existing ideas. These patents on average have higher citations and their firms tend to make higher profits.
Chapter 2 discusses an application of reinforcement learning to outcomes from causal experiments in economics. We model individuals who lost their jobs and arrive sequentially at a policy maker’s office to register as unemployed. After paying a cost to provide job training to an individual, the policy maker observes a treatment effect estimate which we obtain from RCT data. Due to a limited budget, she cannot provide training to all individuals. We use reinforcement learning to solve for the policy function in this dynamic programming problem.
Chapter 3 turns to the analysis of macroeconomic fluctuations. It introduces a mechanism through which perpetual cycles in aggregate output can result endogenously. Individuals share sentiments similarly to diseases in models of disease transmission. Consumption of optimistic consumers is biased upwards and consumption of pessimistic consumers downwards. In a behavioural New Keynesian model, recurring waves of optimism and pessimism lead to cyclical aggregate output as an inherent feature of this economy.
Chapter 4 concludes with a brief empirical investigation of newspaper sentiments and how they fluctuates relative to aggregate economic variables. Here the focus is not on contagion, but on the measurement of aggregate business and economic sentiment since around 1850. Using the archive of the New York Times, I build a historical indicator, discuss its properties, and possible extensions
Estrategias de calentamiento en bandidos multi-brazo para recomendación
Trabajo Fin de Máster en Investigación e Innovación en Inteligencia Computacional y
Sistemas InteractivosRecommender systems have become an essential piece of multiple online platforms such as streaming
services and e-commerce in the last years as they provide users with articles they may find interesting
and thus granting them a personalised experience. The recommendation problem has many opened
investigation lines. One of them is the topic we tackle in this work: the cold-start problem.
In the context of recommender systems the cold-start problem refers to the situation in which a
system does not have enough information to give proper suggestions to the user. The cold-start problem
often occurs because of the following three main reasons: the user to be recommended is new to the
system and thus there is no information about its likes, some of the items that are recommended have
been recently added to the system and they do not have users’ reviews, or the system is completely
new and there is no information about the users nor the items.
Classical recommendation techniques come from Machine learning and they understand recommendation as an static process in which the system provides suggestions to the user and the last rates
them. It is more convenient to understand recommendation as a cycle of constant interaction between
the user and the system and every time a user rates an item, the system uses it to learn from the
user. In that sense we can sacrifice immediate reward in order to earn information about the user and
improve long term reward. This schema establishes a balance between exploration (non-optimal recommendations to learn about the user) and exploitation (optimal recommendations to maximise the
reward). Techniques known as multi-armed bandits are used to get that balance between exploration
and exploitation and we propose them to tackle cold-start problem.
Our hypothesis is that an exploration in the first epochs of the recommendation cycle can lead to
an improvement in the reward during the latest epochs. To test this hypothesis we divide the recommendation loop in two phases: the warm-up, in which we follow a more exploratory approach to get
as much information as possible; and exploitation, in which the system uses the knowledge acquired
during the warm-up to maximise the reward. For this two phases we combine different recommendation
strategies, among which we consider both multi-armed bandits and classic algorithms. We evaluate
them offline in three datasets: CM100K (music), MovieLens1M (films) and Twitter. We also study how
the warm-up duration affects the exploitation phase. Results show that in two dataset (MovieLens and
Twitter) classical algorithms perform better during the exploitation phase in terms of recall after a mainly
exploratory warm-up phase
- …