65 research outputs found

    Hardware-Efficient Scalable Reinforcement Learning Systems

    Get PDF
    Reinforcement Learning (RL) is a machine learning discipline in which an agent learns by interacting with its environment. In this paradigm, the agent is required to perceive its state and take actions accordingly. Upon taking each action, a numerical reward is provided by the environment. The goal of the agent is thus to maximize the aggregate rewards it receives over time. Over the past two decades, a large variety of algorithms have been proposed to select actions in order to explore the environment and gradually construct an e¤ective strategy that maximizes the rewards. These RL techniques have been successfully applied to numerous real-world, complex applications including board games and motor control tasks. Almost all RL algorithms involve the estimation of a value function, which indicates how good it is for the agent to be in a given state, in terms of the total expected reward in the long run. Alternatively, the value function may re‡ect on the impact of taking a particular action at a given state. The most fundamental approach for constructing such a value function consists of updating a table that contains a value for each state (or each state-action pair). However, this approach is impractical for large scale problems, in which the state and/or action spaces are large. In order to deal with such problems, it is necessary to exploit the generalization capabilities of non-linear function approximators, such as arti…cial neural networks. This dissertation focuses on practical methodologies for solving reinforcement learning problems with large state and/or action spaces. In particular, the work addresses scenarios in which an agent does not have full knowledge of its state, but rather receives partial information about its environment via sensory-based observations. In order to address such intricate problems, novel solutions for both tabular and function-approximation based RL frameworks are proposed. A resource-efficient recurrent neural network algorithm is presented, which exploits adaptive step-size techniques to improve learning characteristics. Moreover, a consolidated actor-critic network is introduced, which omits the modeling redundancy found in typical actor-critic systems. Pivotal concerns are the scalability and speed of the learning algorithms, for which we devise architectures that map efficiently to hardware. As a result, a high degree of parallelism can be achieved. Simulation results that correspond to relevant testbench problems clearly demonstrate the solid performance attributes of the proposed solutions

    Interference-based dynamic pricing for WCDMA networks using neurodynamic programming

    Get PDF
    Copyright © 2007 IEEEWe study the problem of optimal integrated dynamic pricing and radio resource management, in terms of resource allocation and call admission control, in a WCDMA network. In such interference-limited network, one's resource usage also degrades the utility of others. A new parameter noise rise factor, which indicates the amount of interference generated by a call, is suggested as a basis for setting price to make users accountable for the congestion externality of their usage. The methods of dynamic programming (DP) are unsuitable for problems with large state spaces due to the associated ldquocurse of dimensionality.rdquo To overcome this, we solve the problem using a simulation-based neurodynamic programming (NDP) method with an action-dependent approximation architecture. Our results show that the proposed optimal policy provides significant average reward and congestion improvement over conventional policies that charge users based on their load factor.Siew-Lee Hew and Langford B. Whit

    Model Predictive Control Based on Deep Learning for Solar Parabolic-Trough Plants

    Get PDF
    En la actualidad, cada vez es mayor el interés por utilizar energías renovables, entre las que se encuentra la energía solar. Las plantas de colectores cilindro-parabólicos son un tipo de planta termosolar donde se hace incidir la radiación del Sol en unos tubos mediante el uso de unos espejos con forma de parábola. En el interior de estos tubos circula un fluido, generalmente aceite o agua, que se calienta para generar vapor y hacer girar una turbina, produciendo energía eléctrica. Uno de los métodos más utilizados para manejar estas plantas es el control predictivo basado en modelo (model predictive control, MPC), cuyo funcionamiento consiste en obtener las señales de control óptimas que se enviarán a la planta basándose en el uso de un modelo de la misma. Este método permite predecir el estado que adoptará el sistema según la estrategia de control escogida a lo largo de un horizonte de tiempo. El MPC tiene como desventaja un gran coste computacional asociado a la resolución de un problema de optimización en cada instante de muestreo. Esto dificulta su implementación en plantas comerciales y de gran tamaño, por lo que, actualmente, uno de los principales retos es la disminución de estos tiempos de cálculo, ya sea tecnológicamente o mediante el uso de técnicas subóptimas que simplifiquen el problema. En este proyecto, se propone el uso de redes neuronales que aprendan offline de la salida proporcionada por un controlador predictivo para luego poder aproximarla. Se han entrenado diferentes redes neuronales utilizando un conjunto de datos de 30 días de simulación y modificando el número de entradas. Los resultados muestran que las redes neuronales son capaces de proporcionar prácticamente la misma potencia que el MPC con variaciones más suaves de la salida y muy bajas violaciones de las restricciones, incluso disminuyendo el número de entradas. El trabajo desarrollado se ha publicado en Renewable Energy, una revista del primer cuartil en Green & sustainable science & technology y Energy and fuels.Nowadays, there is an increasing interest in using renewable energy sources, including solar energy. Parabolic trough plants are a type of solar thermal power plant in which solar radiation is reflected onto tubes with parabolic mirrors. Inside these tubes circulates a fluid, usually oil or water, which is heated to generate steam and turn a turbine to produce electricity. One of the most widely used methods to control these plants is model predictive control (MPC), which obtains the optimal control signals to send to the plant based on the use of a model. This method makes it possible to predict its future state according to the chosen control strategy over a time horizon. The MPC has the disadvantage of a significant computational cost associated with resolving an optimization problem at each sampling time. This makes it challenging to implement in commercial and large plants, so currently, one of the main challenges is to reduce these computational times, either technologically or by using suboptimal techniques that simplify the problem. This project proposes the use of neural networks that learn offline from the output provided by a predictive controller to then approximate it. Different neural networks have been trained using a 30-day simulation dataset and modifying the number of irradiance and temperature inputs. The results show that the neural networks can provide practically the same power as the MPC with smoother variations of the output and very low violations of the constraints, even when decreasing the number of inputs. The work has been published in Renewable Energy, a first quartile journal in Green & sustainable science & technology and Energy and fuels.Universidad de Sevilla. Máster en Ingeniería Industria

    Efficient Reinforcement Learning Using Recursive Least-Squares Methods

    Full text link
    The recursive least-squares (RLS) algorithm is one of the most well-known algorithms used in adaptive filtering, system identification and adaptive control. Its popularity is mainly due to its fast convergence speed, which is considered to be optimal in practice. In this paper, RLS methods are used to solve reinforcement learning problems, where two new reinforcement learning algorithms using linear value function approximators are proposed and analyzed. The two algorithms are called RLS-TD(lambda) and Fast-AHC (Fast Adaptive Heuristic Critic), respectively. RLS-TD(lambda) can be viewed as the extension of RLS-TD(0) from lambda=0 to general lambda within interval [0,1], so it is a multi-step temporal-difference (TD) learning algorithm using RLS methods. The convergence with probability one and the limit of convergence of RLS-TD(lambda) are proved for ergodic Markov chains. Compared to the existing LS-TD(lambda) algorithm, RLS-TD(lambda) has advantages in computation and is more suitable for online learning. The effectiveness of RLS-TD(lambda) is analyzed and verified by learning prediction experiments of Markov chains with a wide range of parameter settings. The Fast-AHC algorithm is derived by applying the proposed RLS-TD(lambda) algorithm in the critic network of the adaptive heuristic critic method. Unlike conventional AHC algorithm, Fast-AHC makes use of RLS methods to improve the learning-prediction efficiency in the critic. Learning control experiments of the cart-pole balancing and the acrobot swing-up problems are conducted to compare the data efficiency of Fast-AHC with conventional AHC. From the experimental results, it is shown that the data efficiency of learning control can also be improved by using RLS methods in the learning-prediction process of the critic. The performance of Fast-AHC is also compared with that of the AHC method using LS-TD(lambda). Furthermore, it is demonstrated in the experiments that different initial values of the variance matrix in RLS-TD(lambda) are required to get better performance not only in learning prediction but also in learning control. The experimental results are analyzed based on the existing theoretical work on the transient phase of forgetting factor RLS methods

    A metaheuristic-based framework for index tracking with practical constraints

    Get PDF
    Recently, numerous investors have shifted from active strategies to passive strategies because the passive strategy approach affords stable returns over the long term. Index tracking is a popular passive strategy. Over the preceding year, most researchers handled this problem via a two-step procedure. However, such a method is a suboptimal global-local optimization technique that frequently results in uncertainty and poor performance. This paper introduces a framework to address the comprehensive index tracking problem (IPT) with a joint approach based on metaheuristics. The purpose of this approach is to globally optimize this problem, where optimization is measured by the tracking error and excess return. Sparsity, weights, assets under management, transaction fees, the full share restriction, and investment risk diversification are considered in this problem. However, these restrictions increase the complexity of the problem and make it a nondeterministic polynomial-time-hard problem. Metaheuristics compose the principal process of the proposed framework, as they balance a desirable tradeoff between the computational resource utilization and the quality of the obtained solution. This framework enables the constructed model to fit future data and facilitates the application of various metaheuristics. Competitive results are achieved by the proposed metaheuristic-based framework in the presented simulation
    corecore