4,626 research outputs found

    Quantum speedup for active learning agents

    Get PDF
    Can quantum mechanics help us in building intelligent robots and agents? One of the defining characteristics of intelligent behavior is the capacity to learn from experience. However, a major bottleneck for agents to learn in any real-life situation is the size and complexity of the corresponding task environment. Owing to, e.g., a large space of possible strategies, learning is typically slow. Even for a moderate task environment, it may simply take too long to rationally respond to a given situation. If the environment is impatient, allowing only a certain time for a response, an agent may then be unable to cope with the situation and to learn at all. Here we show that quantum physics can help and provide a significant speed-up for active learning as a genuine problem of artificial intelligence. We introduce a large class of quantum learning agents for which we show a quadratic boost in their active learning efficiency over their classical analogues. This result will be particularly relevant for applications involving complex task environments.Comment: Minor updates, 14 pages, 3 figure

    Rethinking the Discount Factor in Reinforcement Learning: A Decision Theoretic Approach

    Full text link
    Reinforcement learning (RL) agents have traditionally been tasked with maximizing the value function of a Markov decision process (MDP), either in continuous settings, with fixed discount factor γ<1\gamma < 1, or in episodic settings, with γ=1\gamma = 1. While this has proven effective for specific tasks with well-defined objectives (e.g., games), it has never been established that fixed discounting is suitable for general purpose use (e.g., as a model of human preferences). This paper characterizes rationality in sequential decision making using a set of seven axioms and arrives at a form of discounting that generalizes traditional fixed discounting. In particular, our framework admits a state-action dependent "discount" factor that is not constrained to be less than 1, so long as there is eventual long run discounting. Although this broadens the range of possible preference structures in continuous settings, we show that there exists a unique "optimizing MDP" with fixed γ<1\gamma < 1 whose optimal value function matches the true utility of the optimal policy, and we quantify the difference between value and utility for suboptimal policies. Our work can be seen as providing a normative justification for (a slight generalization of) Martha White's RL task formalism (2017) and other recent departures from the traditional RL, and is relevant to task specification in RL, inverse RL and preference-based RL.Comment: 8 pages + 1 page supplement. In proceedings of AAAI 2019. Slides, poster and bibtex available at https://silviupitis.com/#rethinking-the-discount-factor-in-reinforcement-learning-a-decision-theoretic-approac

    The Influence of Markov Decision Process Structure on the Possible Strategic Use of Working Memory and Episodic Memory

    Get PDF
    Researchers use a variety of behavioral tasks to analyze the effect of biological manipulations on memory function. This research will benefit from a systematic mathematical method for analyzing memory demands in behavioral tasks. In the framework of reinforcement learning theory, these tasks can be mathematically described as partially-observable Markov decision processes. While a wealth of evidence collected over the past 15 years relates the basal ganglia to the reinforcement learning framework, only recently has much attention been paid to including psychological concepts such as working memory or episodic memory in these models. This paper presents an analysis that provides a quantitative description of memory states sufficient for correct choices at specific decision points. Using information from the mathematical structure of the task descriptions, we derive measures that indicate whether working memory (for one or more cues) or episodic memory can provide strategically useful information to an agent. In particular, the analysis determines which observed states must be maintained in or retrieved from memory to perform these specific tasks. We demonstrate the analysis on three simplified tasks as well as eight more complex memory tasks drawn from the animal and human literature (two alternation tasks, two sequence disambiguation tasks, two non-matching tasks, the 2-back task, and the 1-2-AX task). The results of these analyses agree with results from quantitative simulations of the task reported in previous publications and provide simple indications of the memory demands of the tasks which can require far less computation than a full simulation of the task. This may provide a basis for a quantitative behavioral stoichiometry of memory tasks

    Certified Reinforcement Learning with Logic Guidance

    Full text link
    This paper proposes the first model-free Reinforcement Learning (RL) framework to synthesise policies for unknown, and continuous-state Markov Decision Processes (MDPs), such that a given linear temporal property is satisfied. We convert the given property into a Limit Deterministic Buchi Automaton (LDBA), namely a finite-state machine expressing the property. Exploiting the structure of the LDBA, we shape a synchronous reward function on-the-fly, so that an RL algorithm can synthesise a policy resulting in traces that probabilistically satisfy the linear temporal property. This probability (certificate) is also calculated in parallel with policy learning when the state space of the MDP is finite: as such, the RL algorithm produces a policy that is certified with respect to the property. Under the assumption of finite state space, theoretical guarantees are provided on the convergence of the RL algorithm to an optimal policy, maximising the above probability. We also show that our method produces ''best available'' control policies when the logical property cannot be satisfied. In the general case of a continuous state space, we propose a neural network architecture for RL and we empirically show that the algorithm finds satisfying policies, if there exist such policies. The performance of the proposed framework is evaluated via a set of numerical examples and benchmarks, where we observe an improvement of one order of magnitude in the number of iterations required for the policy synthesis, compared to existing approaches whenever available.Comment: This article draws from arXiv:1801.08099, arXiv:1809.0782
    • …
    corecore