8 research outputs found

    Reusing Historical Trajectories in Natural Policy Gradient via Importance Sampling: Convergence and Convergence Rate

    Full text link
    Reinforcement learning provides a mathematical framework for learning-based control, whose success largely depends on the amount of data it can utilize. The efficient utilization of historical trajectories obtained from previous policies is essential for expediting policy optimization. Empirical evidence has shown that policy gradient methods based on importance sampling work well. However, existing literature often neglect the interdependence between trajectories from different iterations, and the good empirical performance lacks a rigorous theoretical justification. In this paper, we study a variant of the natural policy gradient method with reusing historical trajectories via importance sampling. We show that the bias of the proposed estimator of the gradient is asymptotically negligible, the resultant algorithm is convergent, and reusing past trajectories helps improve the convergence rate. We further apply the proposed estimator to popular policy optimization algorithms such as trust region policy optimization. Our theoretical results are verified on classical benchmarks

    Bayesian Stochastic Gradient Descent for Stochastic Optimization with Streaming Input Data

    Full text link
    We consider stochastic optimization under distributional uncertainty, where the unknown distributional parameter is estimated from streaming data that arrive sequentially over time. Moreover, data may depend on the decision of the time when they are generated. For both decision-independent and decision-dependent uncertainties, we propose an approach to jointly estimate the distributional parameter via Bayesian posterior distribution and update the decision by applying stochastic gradient descent on the Bayesian average of the objective function. Our approach converges asymptotically over time and achieves the convergence rates of classical SGD in the decision-independent case. We demonstrate the empirical performance of our approach on both synthetic test problems and a classical newsvendor problem

    Episodic Bayesian Optimal Control with Unknown Randomness Distributions

    Full text link
    Stochastic optimal control with unknown randomness distributions has been studied for a long time, encompassing robust control, distributionally robust control, and adaptive control. We propose a new episodic Bayesian approach that incorporates Bayesian learning with optimal control. In each episode, the approach learns the randomness distribution with a Bayesian posterior and subsequently solves the corresponding Bayesian average estimate of the true problem. The resulting policy is exercised during the episode, while additional data/observations of the randomness are collected to update the Bayesian posterior for the next episode. We show that the resulting episodic value functions and policies converge almost surely to their optimal counterparts of the true problem if the parametrized model of the randomness distribution is correctly specified. We further show that the asymptotic convergence rate of the episodic value functions is of the order O(N−1/2)O(N^{-1/2}). We develop an efficient computational method based on stochastic dual dynamic programming for a class of problems that have convex value functions. Our numerical results on a classical inventory control problem verify the theoretical convergence results and demonstrate the effectiveness of the proposed computational method

    A dual-branch model with inter- and intra-branch contrastive loss for long-tailed recognition

    Full text link
    Real-world data often exhibits a long-tailed distribution, in which head classes occupy most of the data, while tail classes only have very few samples. Models trained on long-tailed datasets have poor adaptability to tail classes and the decision boundaries are ambiguous. Therefore, in this paper, we propose a simple yet effective model, named Dual-Branch Long-Tailed Recognition (DB-LTR), which includes an imbalanced learning branch and a Contrastive Learning Branch (CoLB). The imbalanced learning branch, which consists of a shared backbone and a linear classifier, leverages common imbalanced learning approaches to tackle the data imbalance issue. In CoLB, we learn a prototype for each tail class, and calculate an inter-branch contrastive loss, an intra-branch contrastive loss and a metric loss. CoLB can improve the capability of the model in adapting to tail classes and assist the imbalanced learning branch to learn a well-represented feature space and discriminative decision boundary. Extensive experiments on three long-tailed benchmark datasets, i.e., CIFAR100-LT, ImageNet-LT and Places-LT, show that our DB-LTR is competitive and superior to the comparative methods.Comment: Published at Neural Network

    Bayesian Risk Markov Decision Processes

    Full text link
    We consider finite-horizon Markov Decision Processes where distributional parameters, such as transition probabilities, are unknown and estimated from data. The popular distributionally robust approach to addressing the parameter uncertainty can sometimes be overly conservative. In this paper, we propose a new formulation, Bayesian risk Markov decision process (BR-MDP), to address parameter uncertainty in MDPs, where a risk functional is applied in nested form to the expected total cost with respect to the Bayesian posterior distributions of the unknown parameters. The proposed formulation provides more flexible risk attitudes towards parameter uncertainty and takes into account the availability of data in future time stages. To solve the proposed formulation with the conditional value-at-risk (CVaR) risk functional, we propose an efficient approximation algorithm by deriving an analytical approximation of the value function and utilizing the convexity of CVaR. We demonstrate the empirical performance of the BR-MDP formulation and the proposed algorithms on a gambler's betting problem and an inventory control problem
    corecore