19 research outputs found

    Reinforcement Learning Architectures: SAC, TAC, and ESAC

    Full text link
    The trend is to implement intelligent agents capable of analyzing available information and utilize it efficiently. This work presents a number of reinforcement learning (RL) architectures; one of them is designed for intelligent agents. The proposed architectures are called selector-actor-critic (SAC), tuner-actor-critic (TAC), and estimator-selector-actor-critic (ESAC). These architectures are improved models of a well known architecture in RL called actor-critic (AC). In AC, an actor optimizes the used policy, while a critic estimates a value function and evaluate the optimized policy by the actor. SAC is an architecture equipped with an actor, a critic, and a selector. The selector determines the most promising action at the current state based on the last estimate from the critic. TAC consists of a tuner, a model-learner, an actor, and a critic. After receiving the approximated value of the current state-action pair from the critic and the learned model from the model-learner, the tuner uses the Bellman equation to tune the value of the current state-action pair. ESAC is proposed to implement intelligent agents based on two ideas, which are lookahead and intuition. Lookahead appears in estimating the values of the available actions at the next state, while the intuition appears in maximizing the probability of selecting the most promising action. The newly added elements are an underlying model learner, an estimator, and a selector. The model learner is used to approximate the underlying model. The estimator uses the approximated value function, the learned underlying model, and the Bellman equation to estimate the values of all actions at the next state. The selector is used to determine the most promising action at the next state, which will be used by the actor to optimize the used policy. Finally, the results show the superiority of ESAC compared with the other architectures

    Deep Brain Stimulation Programming 2.0: Future Perspectives for Target Identification and Adaptive Closed Loop Stimulation

    Get PDF
    Deep brain stimulation has developed into an established treatment for movement disorders and is being actively investigated for numerous other neurological as well as psychiatric disorders. An accurate electrode placement in the target area and the effective programming of DBS devices are considered the most important factors for the individual outcome. Recent research in humans highlights the relevance of widespread networks connected to specific DBS targets. Improving the targeting of anatomical and functional networks involved in the generation of pathological neural activity will improve the clinical DBS effect and limit side-effects. Here, we offer a comprehensive overview over the latest research on target structures and targeting strategies in DBS. In addition, we provide a detailed synopsis of novel technologies that will support DBS programming and parameter selection in the future, with a particular focus on closed-loop stimulation and associated biofeedback signals

    Online implementation of a soft actor-critic agent to enhance indoor temperature control and energy efficiency in buildings

    Get PDF
    Recently, a growing interest has been observed in HVAC control systems based on Artificial Intelligence, to improve comfort conditions while avoiding unnecessary energy consumption. In this work, a model-free algorithm belonging to the Deep Reinforcement Learning (DRL) class, Soft Actor-Critic, was implemented to control the supply water temperature to radiant terminal units of a heating system serving an office building. The controller was trained online, and a preliminary sensitivity analysis on hyperparameters was performed to assess their influence on the agent performance. The DRL agent with the best performance was compared to a rule-based controller assumed as a baseline during a three-month heating season. The DRL controller outperformed the baseline after two weeks of deployment, with an overall performance improvement related to control of indoor temperature conditions. Moreover, the adaptability of the DRL agent was tested for various control scenarios, simulating changes of external weather conditions, indoor temperature setpoint, building envelope features and occupancy patterns. The agent dynamically deployed, despite a slight increase in energy consumption, led to an improvement of indoor temperature control, reducing the cumulative sum of temperature violations on average for all scenarios by 75% and 48% compared to the baseline and statically deployed agent respectively

    Selector-Actor-Critic and Tuner-Actor-Critic Algorithms for Reinforcement Learning

    Get PDF
    This work presents two reinforcement learning (RL) architectures, which mimic rational humans in the way of analyzing the available information and making decisions. The proposed algorithms are called selector-actor-critic (SAC) and tuner-actor-critic (TAC). They are obtained by modifying the well known actor-critic (AC) algorithm. SAC is equipped with an actor, a critic, and a selector. The role of the selector is to determine the most promising action at the current state based on the last estimate from the critic. TAC is model based, and consists of a tuner, a model-learner, an actor, and a critic. After receiving the approximated value of the current state-action pair from the critic and the learned model from the model-learner, the tuner uses the Bellman equation to tune the value of the current state-action pair. Then, this tuned value is used by the actor to optimize the policy. We investigate the performance of the proposed algorithms, and compare with AC algorithm to show the advantages of the proposed algorithms using numerical simulations

    Deep Brain Stimulation Programming 2.0: Future Perspectives for Target Identification and Adaptive Closed Loop Stimulation

    Get PDF
    Deep brain stimulation has developed into an established treatment for movement disorders and is being actively investigated for numerous other neurological as well as psychiatric disorders. An accurate electrode placement in the target area and the effective programming of DBS devices are considered the most important factors for the individual outcome. Recent research in humans highlights the relevance of widespread networks connected to specific DBS targets. Improving the targeting of anatomical and functional networks involved in the generation of pathological neural activity will improve the clinical DBS effect and limit side-effects. Here, we offer a comprehensive overview over the latest research on target structures and targeting strategies in DBS. In addition, we provide a detailed synopsis of novel technologies that will support DBS programming and parameter selection in the future, with a particular focus on closed-loop stimulation and associated biofeedback signals

    Enhancing the performance of energy harvesting wireless communications using optimization and machine learning

    Get PDF
    The motivation behind this thesis is to provide efficient solutions for energy harvesting communications. Firstly, an energy harvesting underlay cognitive radio relaying network is investigated. In this context, the secondary network is an energy harvesting network. Closed-form expressions are derived for transmission power of secondary source and relay that maximizes the secondary network throughput. Secondly, a practical scenario in terms of information availability about the environment is investigated. We consider a communications system with a source capable of harvesting solar energy. Two cases are considered based on the knowledge availability about the underlying processes. When this knowledge is available, an algorithm using this knowledge is designed to maximize the expected throughput, while reducing the complexity of traditional methods. For the second case, when the knowledge about the underlying processes is unavailable, reinforcement learning is used. Thirdly, a number of learning architectures for reinforcement learning are introduced. They are called selector-actor-critic, tuner-actor-critic, and estimator-selector-actor-critic. The goal of the selector-actor-critic architecture is to increase the speed and the efficiency of learning an optimal policy by approximating the most promising action at the current state. The tuner-actor-critic aims at improving the learning process by providing the actor with a more accurate estimation about the value function. Estimator-selector-actor-critic is introduced to support intelligent agents. This architecture mimics rational humans in the way of analyzing available information, and making decisions. Then, a harvesting communications system working in an unknown environment is evaluated when it is supported by the proposed architectures. Fourthly, a realistic energy harvesting communications system is investigated. The state and action spaces of the underlying Markov decision process are continuous. Actor-critic is used to optimize the system performance. The critic uses a neural network to approximate the action-value function. The actor uses policy gradient to optimize the policy\u27s parameters to maximize the throughput

    Creación de tareas competitivas multi-agente en Minecraft

    Full text link
    [ES] El objetivo de este trabajo es crear tareas competitivas multiagente en el mundo de Minecraft. Para ello, se utiliza la plataforma Project Malmo, que nos permite crear entornos en este mundo y lanzar agentes que tengan como objetivo cumplir una misión. Mediante XML se crean los escenarios y con Python las dinámicas de las tareas y los algoritmos. Posteriormente, se implementa el algoritmo Q-Learning y adapta a cada una de las tareas. Además, se desarrollan algoritmos de estrategia fija para poder compararlos con el Q-Learning. Con estas comparaciones, se comprueba el beneficio de los algoritmos de aprendizaje automático frente a otros de estrategia fija y la correcta implementación de las tareas.[EN] The aim of this paper is create multi-agent competitive tasks in the world of Minecraft. For it, we use the platform “Project Malmo”,wich allows us to create environments in this world and launch agents which objective is achieve a mission. Through XML, the environments are created and with Python the dynamics of the tasks and the algorithms. Subsequently, the Q-Learning algorithm is implemented and adapted to each of tasks. In addition, fixed strategy algorithms are developed to be able to compare them with Q-Learning. With these comparisons, the benefit of automatic learning algorithms is tested against others of fixed strategy and the correct implementation of the tasks.Martínez Pedrón, CJ. (2018). Creación de tareas competitivas multi-agente en Minecraft. http://hdl.handle.net/10251/111663TFG

    Reinforcement Learning approaches to hippocampus-dependent flexible spatial navigation

    Get PDF
    Humans and non-human animals show great flexibility in spatial navigation, including the ability to return to specific locations based on as few as one single experience. To study spatial navigation in the laboratory, watermaze tasks, in which rats have to find a hidden platform in a pool of cloudy water surrounded by spatial cues, have long been used. Analogous tasks have been developed for human participants using virtual environments. Spatial learning in the watermaze is facilitated by the hippocampus. In particular, rapid, one-trial, allocentric place learning, as measured in the Delayed-Matching-to-Place (DMP) variant of the watermaze task, which requires rodents to learn repeatedly new locations in a familiar environment, is hippocampal dependent. In this article, we review some computational principles, embedded within a Reinforcement Learning (RL) framework, that utilise hippocampal spatial representations for navigation in watermaze tasks. We consider which key elements underlie their efficacy, and discuss their limitations in accounting for hippocampus-dependent navigation, both in terms of behavioural performance (i.e., how well do they reproduce behavioural measures of rapid place learning) and neurobiological realism (i.e., how well do they map to neurobiological substrates involved in rapid place learning). We discuss how an actor-critic architecture, enabling simultaneous assessment of the value of the current location and of the optimal direction to follow, can reproduce one-trial place learning performance as shown on watermaze and virtual DMP tasks by rats and humans, respectively, if complemented with map-like place representations. The contribution of actor-critic mechanisms to DMP performance is consistent with neurobiological findings implicating the striatum and hippocampo-striatal interaction in DMP performance, given that the striatum has been associated with actor-critic mechanisms. Moreover, we illustrate that hierarchical computations embedded within an actor-critic architecture may help to account for aspects of flexible spatial navigation. The hierarchical RL approach separates trajectory control via a temporal-difference error from goal selection via a goal prediction error and may account for flexible, trial-specific, navigation to familiar goal locations, as required in some arm-maze place memory tasks, although it does not capture one-trial learning of new goal locations, as observed in open field, including watermaze and virtual, DMP tasks. Future models of one-shot learning of new goal locations, as observed on DMP tasks, should incorporate hippocampal plasticity mechanisms that integrate new goal information with allocentric place representation, as such mechanisms are supported by substantial empirical evidence

    A computational approach to motivated behaviour and apathy

    Get PDF
    The loss of motivation and goal-directed behaviour is characteristic of apathy. Across a wide range of neuropsychiatric disorders, including Huntington’s disease (HD), apathy is poorly understood, associated with significant morbidity, and is hard to treat. One of the challenges in understanding the neural basis of apathy is moving from phenomenology and behavioural dysfunction to neural circuits in a principled manner. The computational framework offers one such approach. I adopt this framework to better understand motivated behaviour and apathy in four complementary projects. At the heart of many apathy formulations is impaired self-initiation of goal-directed behaviour. An influential computational theory proposes that “opportunity cost”, the amount of reward we stand to lose by not taking actions per unit time, is a key variable in governing the timing of self-initiated behaviour. Using a novel task, I found that free-operant behaviour in healthy participants both in laboratory conditions and in online testing, conforms to predictions of this computational model. Furthermore, in both studies I found that in younger adults sensitivity to opportunity cost predicted behavioural apathy scores. Similar pilot results were found in a cohort of patients with HD. These data suggest that opportunity cost may be an important computational variable relevant for understanding a core feature of apathy – the timing of self-initiated behaviour. In my second project, I used a reinforcement learning paradigm to probe for early dysfunction in a cohort of HD gene carriers approximately 25 years from clinical onset. Based on empirical data and computational models of basal ganglia function I predicted that asymmetry in learning from gains and losses may be an early feature of carrying the HD gene. As predicted, in this task fMRI study, HD gene carriers demonstrated an exaggerated neural response to gains as compared to losses. Gene carriers also differed in the neural response to expected value suggesting that carrying the HD gene is associated with altered processing of valence and value decades from onset. Finally, based on neurocomputational models of basal ganglia pathway function, I tested the hypothesis that apathy in HD would be associated with the involvement of the direct pathway. Support for this hypothesis was found in two related projects. Firstly, using data from a large international HD cohort study, I found that apathy was associated with motor features of the disease thought to represent direct pathway involvement. Secondly, I tested this hypothesis in vivo using resting state fMRI data and a model of basal ganglia connectivity in a large peri-manifest HD cohort. In keeping with my predictions, whilst emerging motor signs were associated with changes in the indirect pathway, apathy scores were associated with connectivity changes in the direct pathway connectivity within my model. For patients with apathy across neuropsychiatry there is an urgent need to understand the neural basis of motivated behaviour in order to develop novel therapies. In this thesis, I have used a computational framework to develop and test a range of hypotheses to advance this understanding. In particular, I have focussed on the computational factors which drive us to self-initiate, their potential neural underpinnings and the relevance of these models for apathy in patients with HD. The data I present supports the hypothesis that opportunity cost and basal ganglia pathway connectivity may be two important components necessary to generate motivated behaviour and contribute to the development of apathy in HD
    corecore