4 research outputs found

    Feature Search in the Grassmanian in Online Reinforcement Learning

    No full text
    We consider the problem of finding the best features for value function approximation in reinforcement learning and develop an online algorithm to optimize the mean square Bellman error objective. For any given feature value, our algorithm performs gradient search in the parameter space via a residual gradient scheme and, on a slower timescale, also performs gradient search in the Grassman manifold of features. We present a proof of convergence of our algorithm. We show empirical results using our algorithm as well as a similar algorithm that uses temporal difference learning in place of the residual gradient scheme for the faster timescale updates

    Actor-Critic Algorithms with Online Feature Adaptation

    No full text
    We develop two new online actor-critic control algorithms with adaptive feature tuning for Markov Decision Processes (MDPs). One of our algorithms is proposed for the long-run average cost objective, while the other works for discounted cost MDPs. Our actor-critic architecture incorporates parameterization both in the policy and the value function. A gradient search in the policy parameters is performed to improve the performance of the actor. The computation of the aforementioned gradient, however, requires an estimate of the value function of the policy corresponding to the current actor parameter. The value function, on the other hand, is approximated using linear function approximation and obtained from the critic. The error in approximation of the value function, however, results in suboptimal policies. In our article, we also update the features by performing a gradient descent on the Grassmannian of features to minimize a mean square Bellman error objective in order to find the best features. The aim is to obtain a good approximation of the value function and thereby ensure convergence of the actor to locally optimal policies. In order to estimate the gradient of the objective in the case of the average cost criterion, we utilize the policy gradient theorem, while in the case of the discounted cost objective, we utilize the simultaneous perturbation stochastic approximation (SPSA) scheme. We prove that our actor-critic algorithms converge to locally optimal policies. Experiments on two different settings show performance improvements resulting from our feature adaptation scheme

    Multi-agent Reinforcement Learning for Traffic Signal Control

    No full text
    Optimal control of traffic lights at junctions or traffic signal control (TSC) is essential for reducing the average delay experienced by the road users amidst the rapid increase in the usage of vehicles. In this paper, we formulate the TSC problem as a discounted cost Markov decision process (MDP) and apply multi-agent reinforcement learning (MARL) algorithms to obtain dynamic TSC policies. We model each traffic signal junction as an independent agent. An agent decides the signal duration of its phases in a round-robin (RR) manner using multi-agent Q-learning with either is an element of-greedy or UCB 3] based exploration strategies. It updates its Q-factors based on the cost feedback signal received from its neighbouring agents. This feedback signal can be easily constructed and is shown to be effective in minimizing the average delay of the vehicles in the network. We show through simulations over VISSIM that our algorithms perform significantly better than both the standard fixed signal timing (FST) algorithm and the saturation balancing (SAT) algorithm 15] over two real road networks
    corecore