388,163 research outputs found

    Factored temporal difference learning in the new ties environment

    Get PDF
    Although reinforcement learning is a popular method for training an agent for decision making based on rewards, well studied tabular methods are not applicable for large, realistic problems. In this paper, we experiment with a factored version of temporal difference learning, which boils down to a linear function approximation scheme utilising natural features coming from the structure of the task. We conducted experiments in the New Ties environment, which is a novel platform for multi-agent simulations. We show that learning utilising a factored representation is effective even in large state spaces, furthermore it outperforms tabular methods even in smaller problems both in learning speed and stability, because of its generalisation capabilities

    An Online Approach to Dynamic Channel Access and Transmission Scheduling

    Full text link
    Making judicious channel access and transmission scheduling decisions is essential for improving performance as well as energy and spectral efficiency in multichannel wireless systems. This problem has been a subject of extensive study in the past decade, and the resulting dynamic and opportunistic channel access schemes can bring potentially significant improvement over traditional schemes. However, a common and severe limitation of these dynamic schemes is that they almost always require some form of a priori knowledge of the channel statistics. A natural remedy is a learning framework, which has also been extensively studied in the same context, but a typical learning algorithm in this literature seeks only the best static policy, with performance measured by weak regret, rather than learning a good dynamic channel access policy. There is thus a clear disconnect between what an optimal channel access policy can achieve with known channel statistics that actively exploits temporal, spatial and spectral diversity, and what a typical existing learning algorithm aims for, which is the static use of a single channel devoid of diversity gain. In this paper we bridge this gap by designing learning algorithms that track known optimal or sub-optimal dynamic channel access and transmission scheduling policies, thereby yielding performance measured by a form of strong regret, the accumulated difference between the reward returned by an optimal solution when a priori information is available and that by our online algorithm. We do so in the context of two specific algorithms that appeared in [1] and [2], respectively, the former for a multiuser single-channel setting and the latter for a single-user multichannel setting. In both cases we show that our algorithms achieve sub-linear regret uniform in time and outperforms the standard weak-regret learning algorithms.Comment: 10 pages, to appear in MobiHoc 201
    corecore