2,054 research outputs found
Policy Optimization of Finite-Horizon Kalman Filter with Unknown Noise Covariance
This paper is on learning the Kalman gain by policy optimization method.
Firstly, we reformulate the finite-horizon Kalman filter as a policy
optimization problem of the dual system. Secondly, we obtain the global linear
convergence of exact gradient descent method in the setting of known
parameters. Thirdly, the gradient estimation and stochastic gradient descent
method are proposed to solve the policy optimization problem, and further the
global linear convergence and sample complexity of stochastic gradient descent
are provided for the setting of unknown noise covariance matrices and known
model parameters
A Policy-Guided Imitation Approach for Offline Reinforcement Learning
Offline reinforcement learning (RL) methods can generally be categorized into
two types: RL-based and Imitation-based. RL-based methods could in principle
enjoy out-of-distribution generalization but suffer from erroneous off-policy
evaluation. Imitation-based methods avoid off-policy evaluation but are too
conservative to surpass the dataset. In this study, we propose an alternative
approach, inheriting the training stability of imitation-style methods while
still allowing logical out-of-distribution generalization. We decompose the
conventional reward-maximizing policy in offline RL into a guide-policy and an
execute-policy. During training, the guide-poicy and execute-policy are learned
using only data from the dataset, in a supervised and decoupled manner. During
evaluation, the guide-policy guides the execute-policy by telling where it
should go so that the reward can be maximized, serving as the \textit{Prophet}.
By doing so, our algorithm allows \textit{state-compositionality} from the
dataset, rather than \textit{action-compositionality} conducted in prior
imitation-style methods. We dumb this new approach Policy-guided Offline RL
(\texttt{POR}). \texttt{POR} demonstrates the state-of-the-art performance on
D4RL, a standard benchmark for offline RL. We also highlight the benefits of
\texttt{POR} in terms of improving with supplementary suboptimal data and
easily adapting to new tasks by only changing the guide-poicy.Comment: Oral @ NeurIPS 2022, code at https://github.com/ryanxhr/PO
- …