16,283 research outputs found
Deep Reinforcement Learning for Stock Portfolio Optimization
Stock portfolio optimization is the process of constant re-distribution of
money to a pool of various stocks. In this paper, we will formulate the problem
such that we can apply Reinforcement Learning for the task properly. To
maintain a realistic assumption about the market, we will incorporate
transaction cost and risk factor into the state as well. On top of that, we
will apply various state-of-the-art Deep Reinforcement Learning algorithms for
comparison. Since the action space is continuous, the realistic formulation
were tested under a family of state-of-the-art continuous policy gradients
algorithms: Deep Deterministic Policy Gradient (DDPG), Generalized
Deterministic Policy Gradient (GDPG) and Proximal Policy Optimization (PPO),
where the former two perform much better than the last one. Next, we will
present the end-to-end solution for the task with Minimum Variance Portfolio
Theory for stock subset selection, and Wavelet Transform for extracting
multi-frequency data pattern. Observations and hypothesis were discussed about
the results, as well as possible future research directions.
Deterministic Policy Gradients With General State Transitions
We study a reinforcement learning setting, where the state transition
function is a convex combination of a stochastic continuous function and a
deterministic function. Such a setting generalizes the widely-studied
stochastic state transition setting, namely the setting of deterministic policy
gradient (DPG).
We firstly give a simple example to illustrate that the deterministic policy
gradient may be infinite under deterministic state transitions, and introduce a
theoretical technique to prove the existence of the policy gradient in this
generalized setting. Using this technique, we prove that the deterministic
policy gradient indeed exists for a certain set of discount factors, and
further prove two conditions that guarantee the existence for all discount
factors. We then derive a closed form of the policy gradient whenever exists.
Furthermore, to overcome the challenge of high sample complexity of DPG in this
setting, we propose the Generalized Deterministic Policy Gradient (GDPG)
algorithm. The main innovation of the algorithm is a new method of applying
model-based techniques to the model-free algorithm, the deep deterministic
policy gradient algorithm (DDPG). GDPG optimize the long-term rewards of the
model-based augmented MDP subject to a constraint that the long-rewards of the
MDP is less than the original one.
We finally conduct extensive experiments comparing GDPG with state-of-the-art
methods and the direct model-based extension method of DDPG on several standard
continuous control benchmarks. Results demonstrate that GDPG substantially
outperforms DDPG, the model-based extension of DDPG and other baselines in
terms of both convergence and long-term rewards in most environments
Gradient Estimation Using Stochastic Computation Graphs
In a variety of problems originating in supervised, unsupervised, and
reinforcement learning, the loss function is defined by an expectation over a
collection of random variables, which might be part of a probabilistic model or
the external world. Estimating the gradient of this loss function, using
samples, lies at the core of gradient-based learning algorithms for these
problems. We introduce the formalism of stochastic computation
graphs---directed acyclic graphs that include both deterministic functions and
conditional probability distributions---and describe how to easily and
automatically derive an unbiased estimator of the loss function's gradient. The
resulting algorithm for computing the gradient estimator is a simple
modification of the standard backpropagation algorithm. The generic scheme we
propose unifies estimators derived in variety of prior work, along with
variance-reduction techniques therein. It could assist researchers in
developing intricate models involving a combination of stochastic and
deterministic operations, enabling, for example, attention, memory, and control
actions.Comment: Advances in Neural Information Processing Systems 28 (NIPS 2015
A Multiagent Reinforcement Learning Algorithm with Non-linear Dynamics
Several multiagent reinforcement learning (MARL) algorithms have been
proposed to optimize agents decisions. Due to the complexity of the problem,
the majority of the previously developed MARL algorithms assumed agents either
had some knowledge of the underlying game (such as Nash equilibria) and/or
observed other agents actions and the rewards they received.
We introduce a new MARL algorithm called the Weighted Policy Learner (WPL),
which allows agents to reach a Nash Equilibrium (NE) in benchmark
2-player-2-action games with minimum knowledge. Using WPL, the only feedback an
agent needs is its own local reward (the agent does not observe other agents
actions or rewards). Furthermore, WPL does not assume that agents know the
underlying game or the corresponding Nash Equilibrium a priori. We
experimentally show that our algorithm converges in benchmark
two-player-two-action games. We also show that our algorithm converges in the
challenging Shapleys game where previous MARL algorithms failed to converge
without knowing the underlying game or the NE. Furthermore, we show that WPL
outperforms the state-of-the-art algorithms in a more realistic setting of 100
agents interacting and learning concurrently.
An important aspect of understanding the behavior of a MARL algorithm is
analyzing the dynamics of the algorithm: how the policies of multiple learning
agents evolve over time as agents interact with one another. Such an analysis
not only verifies whether agents using a given MARL algorithm will eventually
converge, but also reveals the behavior of the MARL algorithm prior to
convergence. We analyze our algorithm in two-player-two-action games and show
that symbolically proving WPLs convergence is difficult, because of the
non-linear nature of WPLs dynamics, unlike previous MARL algorithms that had
either linear or piece-wise-linear dynamics. Instead, we numerically solve WPLs
dynamics differential equations and compare the solution to the dynamics of
previous MARL algorithms
Learning robust control for LQR systems with multiplicative noise via policy gradient
The linear quadratic regulator (LQR) problem has reemerged as an important
theoretical benchmark for reinforcement learning-based control of complex
dynamical systems with continuous state and action spaces. In contrast with
nearly all recent work in this area, we consider multiplicative noise models,
which are increasingly relevant because they explicitly incorporate inherent
uncertainty and variation in the system dynamics and thereby improve robustness
properties of the controller. Robustness is a critical and poorly understood
issue in reinforcement learning; existing methods which do not account for
uncertainty can converge to fragile policies or fail to converge at all.
Additionally, intentional injection of multiplicative noise into learning
algorithms can enhance robustness of policies, as observed in ad hoc work on
domain randomization. Although policy gradient algorithms require optimization
of a non-convex cost function, we show that the multiplicative noise LQR cost
has a special property called gradient domination, which is exploited to prove
global convergence of policy gradient algorithms to the globally optimum
control policy with polynomial dependence on problem parameters. Results are
provided both in the model-known and model-unknown settings where samples of
system trajectories are used to estimate policy gradients
Backprop-Q: Generalized Backpropagation for Stochastic Computation Graphs
In real-world scenarios, it is appealing to learn a model carrying out
stochastic operations internally, known as stochastic computation graphs
(SCGs), rather than learning a deterministic mapping. However, standard
backpropagation is not applicable to SCGs. We attempt to address this issue
from the angle of cost propagation, with local surrogate costs, called
Q-functions, constructed and learned for each stochastic node in an SCG. Then,
the SCG can be trained based on these surrogate costs using standard
backpropagation. We propose the entire framework as a solution to generalize
backpropagation for SCGs, which resembles an actor-critic architecture but
based on a graph. For broad applicability, we study a variety of SCG structures
from one cost to multiple costs. We utilize recent advances in reinforcement
learning (RL) and variational Bayes (VB), such as off-policy critic learning
and unbiased-and-low-variance gradient estimation, and review them in the
context of SCGs. The generalized backpropagation extends transported learning
signals beyond gradients between stochastic nodes while preserving the benefit
of backpropagating gradients through deterministic nodes. Experimental
suggestions and concerns are listed to help design and test any specific model
using this framework.Comment: NeurIPS 2018 Deep Reinforcement Learning Worksho
Merging Deterministic Policy Gradient Estimations with Varied Bias-Variance Tradeoff for Effective Deep Reinforcement Learning
Deep reinforcement learning (DRL) on Markov decision processes (MDPs) with
continuous action spaces is often approached by directly training parametric
policies along the direction of estimated policy gradients (PGs). Previous
research revealed that the performance of these PG algorithms depends heavily
on the bias-variance tradeoffs involved in estimating and using PGs. A notable
approach towards balancing this tradeoff is to merge both on-policy and
off-policy gradient estimations. However existing PG merging methods can be
sample inefficient and are not suitable to train deterministic policies
directly. To address these issues, this paper introduces elite PGs and
strengthens their variance reduction effect by adopting elitism and policy
consolidation techniques to regularize policy training based on policy
behavioral knowledge extracted from elite trajectories. Meanwhile, we propose a
two-step method to merge elite PGs and conventional PGs as a new extension of
the conventional interpolation merging method. At both the theoretical and
experimental levels, we show that both two-step merging and interpolation
merging can induce varied bias-variance tradeoffs during policy training. They
enable us to effectively use elite PGs and mitigate their performance impact on
trained policies. Our experiments also show that two-step merging can
outperform interpolation merging and several state-of-the-art algorithms on six
benchmark control tasks
Simultaneous Perturbation Methods for Adaptive Labor Staffing in Service Systems
Service systems are labor intensive due to the large variation in the tasks
required to address service requests from multiple customers. Aligning the
staffing levels to the forecasted workloads adaptively in such systems is
nontrivial because of a large number of parameters and operational variations
leading to a huge search space. A challenging problem here is to optimize the
staffing while maintaining the system in steady-state and compliant to
aggregate service level agreement (SLA) constraints. Further, because these
parameters change on a weekly basis, the optimization should not take longer
than a few hours. We formulate this problem as a constrained Markov cost
process parameterized by the (discrete) staffing levels. We propose novel
simultaneous perturbation stochastic approximation (SPSA) based SASOC (Staff
Allocation using Stochastic Optimization with Constraints) algorithms for
solving the above problem. The algorithms include both first order as well as
second order methods and incorporate SPSA based gradient estimates in the
primal, with dual ascent for the Lagrange multipliers. Both the algorithms that
we propose are online, incremental and easy to implement. Further, they involve
a certain generalized smooth projection operator, which is essential to project
the continuous-valued worker parameter tuned by SASOC algorithms onto the
discrete set. We validated our algorithms on five real-life service systems and
compared them with a state-of-the-art optimization tool-kit OptQuest. Being 25
times faster than OptQuest, our algorithms are particularly suitable for
adaptive labor staffing. Also, we observe that our algorithms guarantee
convergence and find better solutions than OptQuest in many cases
OffCon: What is state of the art anyway?
Two popular approaches to model-free continuous control tasks are SAC and
TD3. At first glance these approaches seem rather different; SAC aims to solve
the entropy-augmented MDP by minimising the KL-divergence between a stochastic
proposal policy and a hypotheical energy-basd soft Q-function policy, whereas
TD3 is derived from DPG, which uses a deterministic policy to perform policy
gradient ascent along the value function. In reality, both approaches are
remarkably similar, and belong to a family of approaches we call `Off-Policy
Continuous Generalized Policy Iteration'. This illuminates their similar
performance in most continuous control benchmarks, and indeed when
hyperparameters are matched, their performance can be statistically
indistinguishable. To further remove any difference due to implementation, we
provide OffCon (Off-Policy Continuous Control: Consolidated), a code base
featuring state-of-the-art versions of both algorithms
Adaptive Algorithms for Coverage Control and Space Partitioning in Mobile Robotic Networks
This paper considers deployment problems where a mobile robotic network must
optimize its configuration in a distributed way in order to minimize a
steady-state cost function that depends on the spatial distribution of certain
probabilistic events of interest. Moreover, it is assumed that the event
location distribution is a priori unknown, and can only be progressively
inferred from the observation of the actual event occurrences. Three classes of
problems are discussed in detail: coverage control problems, spatial
partitioning problems, and dynamic vehicle routing problems. In each case,
distributed stochastic gradient algorithms optimizing the performance objective
are presented. The stochastic gradient view simplifies and generalizes
previously proposed solutions, and is applicable to new complex scenarios, such
as adaptive coverage involving heterogeneous agents. Remarkably, these
algorithms often take the form of simple distributed rules that could be
implemented on resource-limited platforms.Comment: 16 pages, 4 figures. Long version of a manuscript to appear in the
Transactions on Automatic Contro
- …