1,503 research outputs found

    Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model

    Full text link
    We investigate the sample efficiency of reinforcement learning in a γ\gamma-discounted infinite-horizon Markov decision process (MDP) with state space S\mathcal{S} and action space A\mathcal{A}, assuming access to a generative model. Despite a number of prior work tackling this problem, a complete picture of the trade-offs between sample complexity and statistical accuracy is yet to be determined. In particular, prior results suffer from a sample size barrier, in the sense that their claimed statistical guarantees hold only when the sample size exceeds at least SA(1γ)2\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^2} (up to some log factor). The current paper overcomes this barrier by certifying the minimax optimality of model-based reinforcement learning as soon as the sample size exceeds the order of SA1γ\frac{|\mathcal{S}||\mathcal{A}|}{1-\gamma} (modulo some log factor). More specifically, a perturbed model-based planning algorithm provably finds an ε\varepsilon-optimal policy with an order of SA(1γ)3ε2logSA(1γ)ε\frac{|\mathcal{S}||\mathcal{A}| }{(1-\gamma)^3\varepsilon^2}\log\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)\varepsilon} samples for any ε(0,11γ]\varepsilon \in (0, \frac{1}{1-\gamma}]. Along the way, we derive improved (instance-dependent) guarantees for model-based policy evaluation. To the best of our knowledge, this work provides the first minimax-optimal guarantee in a generative model that accommodates the entire range of sample sizes (beyond which finding a meaningful policy is information theoretically impossible)

    Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs with a Generative Model

    Full text link
    The curse of dimensionality is a widely known issue in reinforcement learning (RL). In the tabular setting where the state space S\mathcal{S} and the action space A\mathcal{A} are both finite, to obtain a nearly optimal policy with sampling access to a generative model, the minimax optimal sample complexity scales linearly with S×A|\mathcal{S}|\times|\mathcal{A}|, which can be prohibitively large when S\mathcal{S} or A\mathcal{A} is large. This paper considers a Markov decision process (MDP) that admits a set of state-action features, which can linearly express (or approximate) its probability transition kernel. We show that a model-based approach (resp. ~Q-learning) provably learns an ε\varepsilon-optimal policy (resp. ~Q-function) with high probability as soon as the sample size exceeds the order of K(1γ)3ε2\frac{K}{(1-\gamma)^{3}\varepsilon^{2}} (resp. ~K(1γ)4ε2\frac{K}{(1-\gamma)^{4}\varepsilon^{2}}), up to some logarithmic factor. Here KK is the feature dimension and γ(0,1)\gamma\in(0,1) is the discount factor of the MDP. Both sample complexity bounds are provably tight, and our result for the model-based approach matches the minimax lower bound. Our results show that for arbitrarily large-scale MDP, both the model-based approach and Q-learning are sample-efficient when KK is relatively small, and hence the title of this paper

    IRGAN: A Minimax Game for Unifying Generative and Discriminative Information Retrieval Models

    Get PDF
    This paper provides a unified account of two schools of thinking in information retrieval modelling: the generative retrieval focusing on predicting relevant documents given a query, and the discriminative retrieval focusing on predicting relevancy given a query-document pair. We propose a game theoretical minimax game to iteratively optimise both models. On one hand, the discriminative model, aiming to mine signals from labelled and unlabelled data, provides guidance to train the generative model towards fitting the underlying relevance distribution over documents given the query. On the other hand, the generative model, acting as an attacker to the current discriminative model, generates difficult examples for the discriminative model in an adversarial way by minimising its discrimination objective. With the competition between these two models, we show that the unified framework takes advantage of both schools of thinking: (i) the generative model learns to fit the relevance distribution over documents via the signals from the discriminative model, and (ii) the discriminative model is able to exploit the unlabelled data selected by the generative model to achieve a better estimation for document ranking. Our experimental results have demonstrated significant performance gains as much as 23.96% on Precision@5 and 15.50% on MAP over strong baselines in a variety of applications including web search, item recommendation, and question answering.Comment: 12 pages; appendix adde
    corecore