4 research outputs found
Learning to Control in Metric Space with Optimal Regret
We study online reinforcement learning for finite-horizon deterministic
control systems with {\it arbitrary} state and action spaces. Suppose that the
transition dynamics and reward function is unknown, but the state and action
space is endowed with a metric that characterizes the proximity between
different states and actions. We provide a surprisingly simple upper-confidence
reinforcement learning algorithm that uses a function approximation oracle to
estimate optimistic Q functions from experiences. We show that the regret of
the algorithm after episodes is where is a
smoothness parameter, and is the doubling dimension of the state-action
space with respect to the given metric. We also establish a near-matching
regret lower bound. The proposed method can be adapted to work for more
structured transition systems, including the finite-state case and the case
where value functions are linear combinations of features, where the method
also achieve the optimal regret
Reward-Directed Conditional Diffusion: Provable Distribution Estimation and Reward Improvement
We explore the methodology and theory of reward-directed generation via
conditional diffusion models. Directed generation aims to generate samples with
desired properties as measured by a reward function, which has broad
applications in generative AI, reinforcement learning, and computational
biology. We consider the common learning scenario where the data set consists
of unlabeled data along with a smaller set of data with noisy reward labels.
Our approach leverages a learned reward function on the smaller data set as a
pseudolabeler. From a theoretical standpoint, we show that this directed
generator can effectively learn and sample from the reward-conditioned data
distribution. Additionally, our model is capable of recovering the latent
subspace representation of data. Moreover, we establish that the model
generates a new population that moves closer to a user-specified target reward
value, where the optimality gap aligns with the off-policy bandit regret in the
feature subspace. The improvement in rewards obtained is influenced by the
interplay between the strength of the reward signal, the distribution shift,
and the cost of off-support extrapolation. We provide empirical results to
validate our theory and highlight the relationship between the strength of
extrapolation and the quality of generated samples
Topics in Low-Rank Markov Decision Process: Applications in Policy Gradient, Model Estimation and Markov Games
In this thesis, we study the topics on Markov Decision Processes (MDP) with a low-rank structure. We begin with the definition of a low-rank Markov Decision Process, and discuss the related applications in the followed chapters.In Chapter 2, we consider the off-policy estimation problem of the policy gradi- ent. We propose an estimator based on Fitted Q Iteration which can work with an arbitrary policy parameterization, assuming access to a Bellman-complete value function class. We provide a tight finite-sample upper bound on the estimation error, given the MDP satisfies the low-rank assumption. Empirically, we evaluate the performance of the estimator on both policy gradient estimation and policy op- timization. Under various metrics, our results show that the estimator significantly outperforms existing off-policy PG estimation methods based on importance sam- pling and variance reduction techniques.
In Chapter 3 and Chapter 4, we study the estimation problem of low-rank MDP models. A tensor-based formulation is proposed to capture the low-rank informa- tion of the model. We develop a tensor-rank-constrained estimator that recovers the model from the collected data, and provide statistical guarantees on the estimation error. The tensor decomposition of the transition model provides useful informa- tion for the reduction of the state and action spaces. We further prove that the learned state/action abstractions provide accurate approximations to latent block structures if they exist, enabling function approximation in downstream tasks such as policy evaluation.
In Chapter 5, we study the representation learning problem of Markov Games, which is a natural extension of the MDPs to the multi-player setting. We present a model-based and a model-free approach to construct an effective representation from the collected data, which is further used to learn an equilibrium policy. A theoretical guarantee is provided, which shows the algorithm is able to find a near- optimal policy with polynomial interactions with the environment. To our best knowledge, this is the first sample-efficient algorithm for multi-agent general-sum Markov games that incorporates function approximation
Recommended from our members
Learning to Control in Metric Space with Optimal Regret
We study online reinforcement learning for finite-horizon deterministic control systems with arbitrary state and action spaces. Suppose the transition dynamics and reward function is unknown, but the state and action space is endowed with a metric that characterizes the proximity between different states and actions. We provide a surprisingly simple upper-confidence reinforcement learning algorithm that uses a function approximation oracle to estimate optimistic Q functions from experiences. We show that the regret of the algorithm after K episodes is o(DLK)^{\frac{d}{d+1}}H where D is the diameter of the state-action space, L is a smoothness parameter, and d is the doubling dimension of the state-action space with respect to the given metric. We also establish a near-matching regret lower bound. The proposed method can be adapted to work for more structured transition systems, including the finite-state case and the case where value functions are linear combinations of features, where the method also achieve the optimal regret. © 2019 IEEE