Search CORE

4 research outputs found

Learning to Control in Metric Space with Optimal Regret

Author: Ni Chengzhuo
Wang Mengdi
Yang Lin F.
Publication venue
Publication date: 04/05/2019
Field of study

We study online reinforcement learning for finite-horizon deterministic control systems with {\it arbitrary} state and action spaces. Suppose that the transition dynamics and reward function is unknown, but the state and action space is endowed with a metric that characterizes the proximity between different states and actions. We provide a surprisingly simple upper-confidence reinforcement learning algorithm that uses a function approximation oracle to estimate optimistic Q functions from experiences. We show that the regret of the algorithm after

K

episodes is

O(HL(KH)^{\frac{d-1}{d}})

where

L

is a smoothness parameter, and

d

is the doubling dimension of the state-action space with respect to the given metric. We also establish a near-matching regret lower bound. The proposed method can be adapted to work for more structured transition systems, including the finite-state case and the case where value functions are linear combinations of features, where the method also achieve the optimal regret

arXiv.org e-Print Archive

Princeton University Open Access Repository

Crossref

Reward-Directed Conditional Diffusion: Provable Distribution Estimation and Reward Improvement

Author: Chen Minshuo
Huang Kaixuan
Ni Chengzhuo
Wang Mengdi
Yuan Hui
Publication venue
Publication date: 13/07/2023
Field of study

We explore the methodology and theory of reward-directed generation via conditional diffusion models. Directed generation aims to generate samples with desired properties as measured by a reward function, which has broad applications in generative AI, reinforcement learning, and computational biology. We consider the common learning scenario where the data set consists of unlabeled data along with a smaller set of data with noisy reward labels. Our approach leverages a learned reward function on the smaller data set as a pseudolabeler. From a theoretical standpoint, we show that this directed generator can effectively learn and sample from the reward-conditioned data distribution. Additionally, our model is capable of recovering the latent subspace representation of data. Moreover, we establish that the model generates a new population that moves closer to a user-specified target reward value, where the optimality gap aligns with the off-policy bandit regret in the feature subspace. The improvement in rewards obtained is influenced by the interplay between the strength of the reward signal, the distribution shift, and the cost of off-support extrapolation. We provide empirical results to validate our theory and highlight the relationship between the strength of extrapolation and the quality of generated samples

arXiv.org e-Print Archive

Topics in Low-Rank Markov Decision Process: Applications in Policy Gradient, Model Estimation and Markov Games

Author: Ni Chengzhuo
Publication venue: Princeton, NJ : Princeton University
Publication date: 01/01/2023
Field of study

In this thesis, we study the topics on Markov Decision Processes (MDP) with a low-rank structure. We begin with the definition of a low-rank Markov Decision Process, and discuss the related applications in the followed chapters.In Chapter 2, we consider the off-policy estimation problem of the policy gradi- ent. We propose an estimator based on Fitted Q Iteration which can work with an arbitrary policy parameterization, assuming access to a Bellman-complete value function class. We provide a tight finite-sample upper bound on the estimation error, given the MDP satisfies the low-rank assumption. Empirically, we evaluate the performance of the estimator on both policy gradient estimation and policy op- timization. Under various metrics, our results show that the estimator significantly outperforms existing off-policy PG estimation methods based on importance sam- pling and variance reduction techniques. In Chapter 3 and Chapter 4, we study the estimation problem of low-rank MDP models. A tensor-based formulation is proposed to capture the low-rank informa- tion of the model. We develop a tensor-rank-constrained estimator that recovers the model from the collected data, and provide statistical guarantees on the estimation error. The tensor decomposition of the transition model provides useful informa- tion for the reduction of the state and action spaces. We further prove that the learned state/action abstractions provide accurate approximations to latent block structures if they exist, enabling function approximation in downstream tasks such as policy evaluation. In Chapter 5, we study the representation learning problem of Markov Games, which is a natural extension of the MDPs to the multi-player setting. We present a model-based and a model-free approach to construct an effective representation from the collected data, which is further used to learn an equilibrium policy. A theoretical guarantee is provided, which shows the algorithm is able to find a near- optimal policy with polynomial interactions with the environment. To our best knowledge, this is the first sample-efficient algorithm for multi-agent general-sum Markov games that incorporates function approximation

Dataspace

Recommended from our members

Learning to Control in Metric Space with Optimal Regret

Author: Ni Chengzhuo
Wang Mengdi
Yang Lin F.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/09/2019
Field of study

We study online reinforcement learning for finite-horizon deterministic control systems with arbitrary state and action spaces. Suppose the transition dynamics and reward function is unknown, but the state and action space is endowed with a metric that characterizes the proximity between different states and actions. We provide a surprisingly simple upper-confidence reinforcement learning algorithm that uses a function approximation oracle to estimate optimistic Q functions from experiences. We show that the regret of the algorithm after K episodes is o(DLK)^{\frac{d}{d+1}}H where D is the diameter of the state-action space, L is a smoothness parameter, and d is the doubling dimension of the state-action space with respect to the given metric. We also establish a near-matching regret lower bound. The proposed method can be adapted to work for more structured transition systems, including the finite-state case and the case where value functions are linear combinations of features, where the method also achieve the optimal regret. © 2019 IEEE

Princeton University Open Access Repository

Crossref