13 research outputs found
Model-based Reinforcement Learning with Parametrized Physical Models and Optimism-Driven Exploration
In this paper, we present a robotic model-based reinforcement learning method
that combines ideas from model identification and model predictive control. We
use a feature-based representation of the dynamics that allows the dynamics
model to be fitted with a simple least squares procedure, and the features are
identified from a high-level specification of the robot's morphology,
consisting of the number and connectivity structure of its links. Model
predictive control is then used to choose the actions under an optimistic model
of the dynamics, which produces an efficient and goal-directed exploration
strategy. We present real time experimental results on standard benchmark
problems involving the pendulum, cartpole, and double pendulum systems.
Experiments indicate that our method is able to learn a range of benchmark
tasks substantially faster than the previous best methods. To evaluate our
approach on a realistic robotic control task, we also demonstrate real time
control of a simulated 7 degree of freedom arm.Comment: 8 page
ΠΠ΅ΡΠ΅Π½ΠΎΡ ΠΏΠΎΠ΄Ρ ΠΎΠ΄Π° ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠ³ΠΎ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ Ρ ΠΏΠΎΠ΄ΠΊΡΠ΅ΠΏΠ»Π΅Π½ΠΈΠ΅ΠΌ Ρ ΡΠΈΠΌΡΠ»ΡΡΠΈΠΎΠ½Π½ΠΎΠΉ ΠΌΠΎΠ΄Π΅Π»ΠΈ Π½Π° ΠΌΠΎΠ±ΠΈΠ»ΡΠ½ΠΎΠ³ΠΎ ΡΠΎΠ±ΠΎΡΠ°
ΠΠ±ΡΡΠ΅Π½ΠΈΠ΅ Ρ ΠΏΠΎΠ΄ΠΊΡΠ΅ΠΏΠ»Π΅Π½ΠΈΠ΅ΠΌ, ΠΊΠ°ΠΊ ΠΎΠ΄ΠΈΠ½ ΠΈΠ· ΡΠΏΠΎΡΠΎΠ±ΠΎΠ² ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠ³ΠΎ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ, ΠΏΠΎΠΊΠ°Π·ΡΠ²Π°Π΅Ρ ΠΌΠ½ΠΎΠ³ΠΎΠΎΠ±Π΅ΡΠ°ΡΡΠΈΠ΅ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΡ ΠΏΡΠΈ Π΅Π³ΠΎ ΠΈΠ½ΡΠ΅Π³ΡΠ°ΡΠΈΠΈ Π² ΡΠ°Π·Π»ΠΈΡΠ½ΡΠ΅ ΡΠΎΠ±ΠΎΡΠΎΡΠ΅Ρ
Π½ΠΈΡΠ΅ΡΠΊΠΈΠ΅ Π°Π»Π³ΠΎΡΠΈΡΠΌΡ. ΠΠΎ Π΄Π»Ρ ΡΠΎΠ³ΠΎ, ΡΡΠΎΠ±Ρ Π΄ΠΎΠ±ΠΈΡΡΡΡ ΠΎΠΏΡΠΈΠΌΠ°Π»ΡΠ½ΠΎΠ³ΠΎ ΠΏΠΎΠ²Π΅Π΄Π΅Π½ΠΈΡ ΡΠΎΠ±ΠΎΡΠ°, ΡΡΠ΅Π±ΡΠ΅ΡΡΡ Π·Π½Π°ΡΠΈΡΠ΅Π»ΡΠ½ΠΎΠ΅ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ Π²ΡΠ΅ΠΌΠ΅Π½ΠΈ ΠΈ ΡΠ΅ΡΡΡΡΠΎΠ². ΠΡΠΏΠΎΠ»ΡΠ·ΡΡ Π²ΠΈΡΡΡΠ°Π»ΡΠ½ΡΠ΅ ΡΠΊΡΠΏΠ΅ΡΠΈΠΌΠ΅Π½ΡΡ, Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎ Π·Π½Π°ΡΠΈΡΠ΅Π»ΡΠ½ΠΎ ΡΡΠΊΠΎΡΠΈΡΡ ΠΈ ΡΠ»ΡΡΡΠΈΡΡ ΠΏΡΠΎΠΈΠ·Π²ΠΎΠ΄ΠΈΡΠ΅Π»ΡΠ½ΠΎΡΡΡ Π°Π»Π³ΠΎΡΠΈΡΠΌΠΎΠ². ΠΡ Π²Π½Π΅Π΄ΡΠΈΠ»ΠΈ ΠΏΠΎΠ΄Ρ
ΠΎΠ΄ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ Ρ ΠΏΠΎΠ΄ΠΊΡΠ΅ΠΏΠ»Π΅Π½ΠΈΠ΅ΠΌ Π΄Π»Ρ Π°Π»Π³ΠΎΡΠΈΡΠΌΠ° Π»ΠΎΠΊΠ°Π»ΠΈΠ·Π°ΡΠΈΠΈ ΠΈ ΠΊΠ°ΡΡΠΎΠ³ΡΠ°ΡΠΈΡΠΎΠ²Π°Π½ΠΈΡ, ΠΏΡΠΈΠΌΠ΅Π½ΡΠ΅ΠΌΠΎΠ³ΠΎ Π½Π° ΠΌΠΎΠ±ΠΈΠ»ΡΠ½ΠΎΠΌ ΡΠΎΠ±ΠΎΡΠ΅. ΠΠ»Π³ΠΎΡΠΈΡΠΌ Π±ΡΠ» ΠΎΠ±ΡΡΠ΅Π½ Π² ΡΠΈΠΌΡΠ»ΡΡΠΈΠΎΠ½Π½ΠΎΠΉ ΡΡΠ΅Π΄Π΅ Gazebo ΠΈ ΠΏΠ΅ΡΠ΅Π½Π΅ΡΠ΅Π½ Π½Π° ΡΠ΅Π°Π»ΡΠ½ΠΎΠ³ΠΎ ΡΠΎΠ±ΠΎΡΠ°. Π ΠΏΡΠ±Π»ΠΈΠΊΠ°ΡΠΈΠΈ ΠΏΠΎΠΊΠ°Π·Π°Π½Π° ΡΠ΅Π»Π΅ΡΠΎΠΎΠ±ΡΠ°Π·Π½ΠΎΡΡΡ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΡ ΡΠΈΠΌΡΠ»ΡΡΠΈΠΈ Π΄Π»Ρ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ Π°Π»Π³ΠΎΡΠΈΡΠΌΠΎΠ², ΠΏΡΠΈΠΌΠ΅Π½ΡΠ΅ΠΌΡΡ
ΠΌΠΎΠ±ΠΈΠ»ΡΠ½ΡΠΌΠΈ ΡΠΎΠ±ΠΎΡΠ°ΠΌΠΈ.201-21
Deterministic Value-Policy Gradients
Reinforcement learning algorithms such as the deep deterministic policy
gradient algorithm (DDPG) has been widely used in continuous control tasks.
However, the model-free DDPG algorithm suffers from high sample complexity. In
this paper we consider the deterministic value gradients to improve the sample
efficiency of deep reinforcement learning algorithms. Previous works consider
deterministic value gradients with the finite horizon, but it is too myopic
compared with infinite horizon. We firstly give a theoretical guarantee of the
existence of the value gradients in this infinite setting. Based on this
theoretical guarantee, we propose a class of the deterministic value gradient
algorithm (DVG) with infinite horizon, and different rollout steps of the
analytical gradients by the learned model trade off between the variance of the
value gradients and the model bias. Furthermore, to better combine the
model-based deterministic value gradient estimators with the model-free
deterministic policy gradient estimator, we propose the deterministic
value-policy gradient (DVPG) algorithm. We finally conduct extensive
experiments comparing DVPG with state-of-the-art methods on several standard
continuous control benchmarks. Results demonstrate that DVPG substantially
outperforms other baselines