13 research outputs found

    Model-based Reinforcement Learning with Parametrized Physical Models and Optimism-Driven Exploration

    Full text link
    In this paper, we present a robotic model-based reinforcement learning method that combines ideas from model identification and model predictive control. We use a feature-based representation of the dynamics that allows the dynamics model to be fitted with a simple least squares procedure, and the features are identified from a high-level specification of the robot's morphology, consisting of the number and connectivity structure of its links. Model predictive control is then used to choose the actions under an optimistic model of the dynamics, which produces an efficient and goal-directed exploration strategy. We present real time experimental results on standard benchmark problems involving the pendulum, cartpole, and double pendulum systems. Experiments indicate that our method is able to learn a range of benchmark tasks substantially faster than the previous best methods. To evaluate our approach on a realistic robotic control task, we also demonstrate real time control of a simulated 7 degree of freedom arm.Comment: 8 page

    ΠŸΠ΅Ρ€Π΅Π½ΠΎΡ ΠΏΠΎΠ΄Ρ…ΠΎΠ΄Π° машинного обучСния с ΠΏΠΎΠ΄ΠΊΡ€Π΅ΠΏΠ»Π΅Π½ΠΈΠ΅ΠΌ с симуляционной ΠΌΠΎΠ΄Π΅Π»ΠΈ Π½Π° мобильного Ρ€ΠΎΠ±ΠΎΡ‚Π°

    Get PDF
    ΠžΠ±ΡƒΡ‡Π΅Π½ΠΈΠ΅ с ΠΏΠΎΠ΄ΠΊΡ€Π΅ΠΏΠ»Π΅Π½ΠΈΠ΅ΠΌ, ΠΊΠ°ΠΊ ΠΎΠ΄ΠΈΠ½ ΠΈΠ· способов машинного обучСния, ΠΏΠΎΠΊΠ°Π·Ρ‹Π²Π°Π΅Ρ‚ ΠΌΠ½ΠΎΠ³ΠΎΠΎΠ±Π΅Ρ‰Π°ΡŽΡ‰ΠΈΠ΅ Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚Ρ‹ ΠΏΡ€ΠΈ Π΅Π³ΠΎ ΠΈΠ½Ρ‚Π΅Π³Ρ€Π°Ρ†ΠΈΠΈ Π² Ρ€Π°Π·Π»ΠΈΡ‡Π½Ρ‹Π΅ робототСхничСскиС Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΡ‹. Но для Ρ‚ΠΎΠ³ΠΎ, Ρ‡Ρ‚ΠΎΠ±Ρ‹ Π΄ΠΎΠ±ΠΈΡ‚ΡŒΡΡ ΠΎΠΏΡ‚ΠΈΠΌΠ°Π»ΡŒΠ½ΠΎΠ³ΠΎ повСдСния Ρ€ΠΎΠ±ΠΎΡ‚Π°, трСбуСтся Π·Π½Π°Ρ‡ΠΈΡ‚Π΅Π»ΡŒΠ½ΠΎΠ΅ количСство Π²Ρ€Π΅ΠΌΠ΅Π½ΠΈ ΠΈ рСсурсов. Π˜ΡΠΏΠΎΠ»ΡŒΠ·ΡƒΡ Π²ΠΈΡ€Ρ‚ΡƒΠ°Π»ΡŒΠ½Ρ‹Π΅ экспСримСнты, Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎ Π·Π½Π°Ρ‡ΠΈΡ‚Π΅Π»ΡŒΠ½ΠΎ ΡƒΡΠΊΠΎΡ€ΠΈΡ‚ΡŒ ΠΈ ΡƒΠ»ΡƒΡ‡ΡˆΠΈΡ‚ΡŒ ΠΏΡ€ΠΎΠΈΠ·Π²ΠΎΠ΄ΠΈΡ‚Π΅Π»ΡŒΠ½ΠΎΡΡ‚ΡŒ Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠΎΠ². ΠœΡ‹ Π²Π½Π΅Π΄Ρ€ΠΈΠ»ΠΈ ΠΏΠΎΠ΄Ρ…ΠΎΠ΄ обучСния с ΠΏΠΎΠ΄ΠΊΡ€Π΅ΠΏΠ»Π΅Π½ΠΈΠ΅ΠΌ для Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠ° Π»ΠΎΠΊΠ°Π»ΠΈΠ·Π°Ρ†ΠΈΠΈ ΠΈ картографирования, примСняСмого Π½Π° мобильном Ρ€ΠΎΠ±ΠΎΡ‚Π΅. Алгоритм Π±Ρ‹Π» ΠΎΠ±ΡƒΡ‡Π΅Π½ Π² симуляционной срСдС Gazebo ΠΈ пСрСнСсСн Π½Π° Ρ€Π΅Π°Π»ΡŒΠ½ΠΎΠ³ΠΎ Ρ€ΠΎΠ±ΠΎΡ‚Π°. Π’ ΠΏΡƒΠ±Π»ΠΈΠΊΠ°Ρ†ΠΈΠΈ ΠΏΠΎΠΊΠ°Π·Π°Π½Π° Ρ†Π΅Π»Π΅ΡΠΎΠΎΠ±Ρ€Π°Π·Π½ΠΎΡΡ‚ΡŒ использования симуляции для обучСния Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠΎΠ², примСняСмых ΠΌΠΎΠ±ΠΈΠ»ΡŒΠ½Ρ‹ΠΌΠΈ Ρ€ΠΎΠ±ΠΎΡ‚Π°ΠΌΠΈ.201-21

    Deterministic Value-Policy Gradients

    Full text link
    Reinforcement learning algorithms such as the deep deterministic policy gradient algorithm (DDPG) has been widely used in continuous control tasks. However, the model-free DDPG algorithm suffers from high sample complexity. In this paper we consider the deterministic value gradients to improve the sample efficiency of deep reinforcement learning algorithms. Previous works consider deterministic value gradients with the finite horizon, but it is too myopic compared with infinite horizon. We firstly give a theoretical guarantee of the existence of the value gradients in this infinite setting. Based on this theoretical guarantee, we propose a class of the deterministic value gradient algorithm (DVG) with infinite horizon, and different rollout steps of the analytical gradients by the learned model trade off between the variance of the value gradients and the model bias. Furthermore, to better combine the model-based deterministic value gradient estimators with the model-free deterministic policy gradient estimator, we propose the deterministic value-policy gradient (DVPG) algorithm. We finally conduct extensive experiments comparing DVPG with state-of-the-art methods on several standard continuous control benchmarks. Results demonstrate that DVPG substantially outperforms other baselines
    corecore