429 research outputs found
Meta-descent for Online, Continual Prediction
This paper investigates different vector step-size adaptation approaches for
non-stationary online, continual prediction problems. Vanilla stochastic
gradient descent can be considerably improved by scaling the update with a
vector of appropriately chosen step-sizes. Many methods, including AdaGrad,
RMSProp, and AMSGrad, keep statistics about the learning process to approximate
a second order update---a vector approximation of the inverse Hessian. Another
family of approaches use meta-gradient descent to adapt the step-size
parameters to minimize prediction error. These meta-descent strategies are
promising for non-stationary problems, but have not been as extensively
explored as quasi-second order methods. We first derive a general, incremental
meta-descent algorithm, called AdaGain, designed to be applicable to a much
broader range of algorithms, including those with semi-gradient updates or even
those with accelerations, such as RMSProp. We provide an empirical comparison
of methods from both families. We conclude that methods from both families can
perform well, but in non-stationary prediction problems the meta-descent
methods exhibit advantages. Our method is particularly robust across several
prediction problems, and is competitive with the state-of-the-art method on a
large-scale, time-series prediction problem on real data from a mobile robot.Comment: AAAI Conference on Artificial Intelligence 2019. v2: Correction to
Baird's counterexample. A bug in the code lead to results being reported for
AMSGrad in this experiment, when they were actually results for Ada
Hardware-Efficient Scalable Reinforcement Learning Systems
Reinforcement Learning (RL) is a machine learning discipline in which an agent learns by interacting with its environment. In this paradigm, the agent is required to perceive its state and take actions accordingly. Upon taking each action, a numerical reward is provided by the environment. The goal of the agent is thus to maximize the aggregate rewards it receives over time. Over the past two decades, a large variety of algorithms have been proposed to select actions in order to explore the environment and gradually construct an e¤ective strategy that maximizes the rewards. These RL techniques have been successfully applied to numerous real-world, complex applications including board games and motor control tasks.
Almost all RL algorithms involve the estimation of a value function, which indicates how good it is for the agent to be in a given state, in terms of the total expected reward in the long run. Alternatively, the value function may re‡ect on the impact of taking a particular action at a given state. The most fundamental approach for constructing such a value function consists of updating a table that contains a value for each state (or each state-action pair). However, this approach is impractical for large scale problems, in which the state and/or action spaces are large. In order to deal with such problems, it is necessary to exploit the generalization capabilities of non-linear function approximators, such as arti…cial neural networks.
This dissertation focuses on practical methodologies for solving reinforcement learning problems with large state and/or action spaces. In particular, the work addresses scenarios in which an agent does not have full knowledge of its state, but rather receives partial information about its environment via sensory-based observations. In order to address such intricate problems, novel solutions for both tabular and function-approximation based RL frameworks are proposed. A resource-efficient recurrent neural network algorithm is presented, which exploits adaptive step-size techniques to improve learning characteristics. Moreover, a consolidated actor-critic network is introduced, which omits the modeling redundancy found in typical actor-critic systems. Pivotal concerns are the scalability and speed of the learning algorithms, for which we devise architectures that map efficiently to hardware. As a result, a high degree of parallelism can be achieved. Simulation results that correspond to relevant testbench problems clearly demonstrate the solid performance attributes of the proposed solutions
Symmetric complex-valued RBF receiver for multiple-antenna aided wireless systems
A nonlinear beamforming assisted detector is proposed for multiple-antenna-aided wireless systems employing complex-valued quadrature phase shift-keying modulation. By exploiting the inherent symmetry of the optimal Bayesian detection solution, a novel complex-valued symmetric radial basis function (SRBF)-network-based detector is developed, which is capable of approaching the optimal Bayesian performance using channel-impaired training data. In the uplink case, adaptive nonlinear beamforming can be efficiently implemented by estimating the system’s channel matrix based on the least squares channel estimate. Adaptive implementation of nonlinear beamforming in the downlink case by contrast is much more challenging, and we adopt a cluster-variationenhanced clustering algorithm to directly identify the SRBF center vectors required for realizing the optimal Bayesian detector. A simulation example is included to demonstrate the achievable performance improvement by the proposed adaptive nonlinear beamforming solution over the theoretical linear minimum bit error rate beamforming benchmark
Smoothing Policies and Safe Policy Gradients
Policy gradient algorithms are among the best candidates for the much
anticipated application of reinforcement learning to real-world control tasks,
such as the ones arising in robotics. However, the trial-and-error nature of
these methods introduces safety issues whenever the learning phase itself must
be performed on a physical system. In this paper, we address a specific safety
formulation, where danger is encoded in the reward signal and the learning
agent is constrained to never worsen its performance. By studying actor-only
policy gradient from a stochastic optimization perspective, we establish
improvement guarantees for a wide class of parametric policies, generalizing
existing results on Gaussian policies. This, together with novel upper bounds
on the variance of policy gradient estimators, allows to identify those
meta-parameter schedules that guarantee monotonic improvement with high
probability. The two key meta-parameters are the step size of the parameter
updates and the batch size of the gradient estimators. By a joint, adaptive
selection of these meta-parameters, we obtain a safe policy gradient algorithm
- …