346 research outputs found

    Optimistic Exploration even with a Pessimistic Initialisation

    Full text link
    Optimistic initialisation is an effective strategy for efficient exploration in reinforcement learning (RL). In the tabular case, all provably efficient model-free algorithms rely on it. However, model-free deep RL algorithms do not use optimistic initialisation despite taking inspiration from these provably efficient tabular algorithms. In particular, in scenarios with only positive rewards, Q-values are initialised at their lowest possible values due to commonly used network initialisation schemes, a pessimistic initialisation. Merely initialising the network to output optimistic Q-values is not enough, since we cannot ensure that they remain optimistic for novel state-action pairs, which is crucial for exploration. We propose a simple count-based augmentation to pessimistically initialised Q-values that separates the source of optimism from the neural network. We show that this scheme is provably efficient in the tabular setting and extend it to the deep RL setting. Our algorithm, Optimistic Pessimistically Initialised Q-Learning (OPIQ), augments the Q-value estimates of a DQN-based agent with count-derived bonuses to ensure optimism during both action selection and bootstrapping. We show that OPIQ outperforms non-optimistic DQN variants that utilise a pseudocount-based intrinsic motivation in hard exploration tasks, and that it predicts optimistic estimates for novel state-action pairs.Comment: Published as a conference paper at ICLR 202

    Softmax exploration strategies for multiobjective reinforcement learning

    Get PDF
    Despite growing interest over recent years in applying reinforcement learning to multiobjective problems, there has been little research into the applicability and effectiveness of exploration strategies within the multiobjective context. This work considers several widely-used approaches to exploration from the single-objective reinforcement learning literature, and examines their incorporation into multiobjective Q-learning. In particular this paper proposes two novel approaches which extend the softmax operator to work with vector-valued rewards. The performance of these exploration strategies is evaluated across a set of benchmark environments. Issues arising from the multiobjective formulation of these benchmarks which impact on the performance of the exploration strategies are identified. It is shown that of the techniques considered, the combination of the novel softmax–epsilon exploration with optimistic initialisation provides the most effective trade-off between exploration and exploitation

    Cautious Reinforcement Learning with Logical Constraints

    Full text link
    This paper presents the concept of an adaptive safe padding that forces Reinforcement Learning (RL) to synthesise optimal control policies while ensuring safety during the learning process. Policies are synthesised to satisfy a goal, expressed as a temporal logic formula, with maximal probability. Enforcing the RL agent to stay safe during learning might limit the exploration, however we show that the proposed architecture is able to automatically handle the trade-off between efficient progress in exploration (towards goal satisfaction) and ensuring safety. Theoretical guarantees are available on the optimality of the synthesised policies and on the convergence of the learning algorithm. Experimental results are provided to showcase the performance of the proposed method.Comment: Accepted to AAMAS 2020. arXiv admin note: text overlap with arXiv:1902.0077

    Essays in economics and machine learning

    Get PDF
    This thesis studies questions in innovation, optimal policy, and aggregate fluctuations, partially with the help of methods from machine learning and artificial intelligence. Chapter 1 is concerned with innovation in patents. With tools from natural language processing, we represent around 4.6 million patents as high dimensional numerical vectors and find a rich underlying structure. We measure economy wide and field specific trends and detect patents which anticipated such trends or widened existing ideas. These patents on average have higher citations and their firms tend to make higher profits. Chapter 2 discusses an application of reinforcement learning to outcomes from causal experiments in economics. We model individuals who lost their jobs and arrive sequentially at a policy maker’s office to register as unemployed. After paying a cost to provide job training to an individual, the policy maker observes a treatment effect estimate which we obtain from RCT data. Due to a limited budget, she cannot provide training to all individuals. We use reinforcement learning to solve for the policy function in this dynamic programming problem. Chapter 3 turns to the analysis of macroeconomic fluctuations. It introduces a mechanism through which perpetual cycles in aggregate output can result endogenously. Individuals share sentiments similarly to diseases in models of disease transmission. Consumption of optimistic consumers is biased upwards and consumption of pessimistic consumers downwards. In a behavioural New Keynesian model, recurring waves of optimism and pessimism lead to cyclical aggregate output as an inherent feature of this economy. Chapter 4 concludes with a brief empirical investigation of newspaper sentiments and how they fluctuates relative to aggregate economic variables. Here the focus is not on contagion, but on the measurement of aggregate business and economic sentiment since around 1850. Using the archive of the New York Times, I build a historical indicator, discuss its properties, and possible extensions

    Estrategias de calentamiento en bandidos multi-brazo para recomendación

    Full text link
    Trabajo Fin de Máster en Investigación e Innovación en Inteligencia Computacional y Sistemas InteractivosRecommender systems have become an essential piece of multiple online platforms such as streaming services and e-commerce in the last years as they provide users with articles they may find interesting and thus granting them a personalised experience. The recommendation problem has many opened investigation lines. One of them is the topic we tackle in this work: the cold-start problem. In the context of recommender systems the cold-start problem refers to the situation in which a system does not have enough information to give proper suggestions to the user. The cold-start problem often occurs because of the following three main reasons: the user to be recommended is new to the system and thus there is no information about its likes, some of the items that are recommended have been recently added to the system and they do not have users’ reviews, or the system is completely new and there is no information about the users nor the items. Classical recommendation techniques come from Machine learning and they understand recommendation as an static process in which the system provides suggestions to the user and the last rates them. It is more convenient to understand recommendation as a cycle of constant interaction between the user and the system and every time a user rates an item, the system uses it to learn from the user. In that sense we can sacrifice immediate reward in order to earn information about the user and improve long term reward. This schema establishes a balance between exploration (non-optimal recommendations to learn about the user) and exploitation (optimal recommendations to maximise the reward). Techniques known as multi-armed bandits are used to get that balance between exploration and exploitation and we propose them to tackle cold-start problem. Our hypothesis is that an exploration in the first epochs of the recommendation cycle can lead to an improvement in the reward during the latest epochs. To test this hypothesis we divide the recommendation loop in two phases: the warm-up, in which we follow a more exploratory approach to get as much information as possible; and exploitation, in which the system uses the knowledge acquired during the warm-up to maximise the reward. For this two phases we combine different recommendation strategies, among which we consider both multi-armed bandits and classic algorithms. We evaluate them offline in three datasets: CM100K (music), MovieLens1M (films) and Twitter. We also study how the warm-up duration affects the exploitation phase. Results show that in two dataset (MovieLens and Twitter) classical algorithms perform better during the exploitation phase in terms of recall after a mainly exploratory warm-up phase
    corecore