87 research outputs found

    A survey on multi-player bandits

    Get PDF
    works released after June 2022 are not considered in this surveyDue mostly to its application to cognitive radio networks, multiplayer bandits gained a lot of interest in the last decade. A considerable progress has been made on its theoretical aspect. However, the current algorithms are far from applicable and many obstacles remain between these theoretical results and a possible implementation of multiplayer bandits algorithms in real cognitive radio networks. This survey contextualizes and organizes the rich multiplayer bandits literature. In light of the existing works, some clear directions for future research appear. We believe that a further study of these different directions might lead to theoretical algorithms adapted to real-world situations

    Multi-Flow Transmission in Wireless Interference Networks: A Convergent Graph Learning Approach

    Full text link
    We consider the problem of of multi-flow transmission in wireless networks, where data signals from different flows can interfere with each other due to mutual interference between links along their routes, resulting in reduced link capacities. The objective is to develop a multi-flow transmission strategy that routes flows across the wireless interference network to maximize the network utility. However, obtaining an optimal solution is computationally expensive due to the large state and action spaces involved. To tackle this challenge, we introduce a novel algorithm called Dual-stage Interference-Aware Multi-flow Optimization of Network Data-signals (DIAMOND). The design of DIAMOND allows for a hybrid centralized-distributed implementation, which is a characteristic of 5G and beyond technologies with centralized unit deployments. A centralized stage computes the multi-flow transmission strategy using a novel design of graph neural network (GNN) reinforcement learning (RL) routing agent. Then, a distributed stage improves the performance based on a novel design of distributed learning updates. We provide a theoretical analysis of DIAMOND and prove that it converges to the optimal multi-flow transmission strategy as time increases. We also present extensive simulation results over various network topologies (random deployment, NSFNET, GEANT2), demonstrating the superior performance of DIAMOND compared to existing methods

    Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

    Full text link
    Multi-armed bandit problems are the most basic examples of sequential decision problems with an exploration-exploitation trade-off. This is the balance between staying with the option that gave highest payoffs in the past and exploring new options that might give higher payoffs in the future. Although the study of bandit problems dates back to the Thirties, exploration-exploitation trade-offs arise in several modern applications, such as ad placement, website optimization, and packet routing. Mathematically, a multi-armed bandit is defined by the payoff process associated with each option. In this survey, we focus on two extreme cases in which the analysis of regret is particularly simple and elegant: i.i.d. payoffs and adversarial payoffs. Besides the basic setting of finitely many actions, we also analyze some of the most important variants and extensions, such as the contextual bandit model.Comment: To appear in Foundations and Trends in Machine Learnin

    A Gentle Introduction to Reinforcement Learning and its Application in Different Fields

    Get PDF
    Due to the recent progress in Deep Neural Networks, Reinforcement Learning (RL) has become one of the most important and useful technology. It is a learning method where a software agent interacts with an unknown environment, selects actions, and progressively discovers the environment dynamics. RL has been effectively applied in many important areas of real life. This article intends to provide an in-depth introduction of the Markov Decision Process, RL and its algorithms. Moreover, we present a literature review of the application of RL to a variety of fields, including robotics and autonomous control, communication and networking, natural language processing, games and self-organized system, scheduling management and configuration of resources, and computer vision

    Energy efficient resource allocation for future wireless communication systemsy

    Get PDF
    Next generation of wireless communication systems envisions a massive number of connected battery powered wireless devices. Replacing the battery of such devices is expensive, costly, or infeasible. To this end, energy harvesting (EH) is a promising technique to prolong the lifetime of such devices. Because of randomness in amount and availability of the harvested energy, existing communication techniques require revisions to address the issues specific to EH systems. In this thesis, we aim at revisiting fundamental wireless communication problems and addressing the future perspective on service based applications with the specific characteristics of the EH in mind. In the first part of the thesis, we address three fundamental problems that exist in the wireless communication systems, namely; multiple access strategy, overcoming the wireless channel, and providing reliability. Since the wireless channel is a shared medium, concurrent transmissions of multiple devices cause interference which results in collision and eventual loss of the transmitted data. Multiple access protocols aim at providing a coordination mechanism between multiple transmissions so as to enable a collision free medium. We revisit the random access protocol for its distributed and low energy characteristics while incorporating the statistical correlation of the EH processes across two transmitters. We design a simple threshold based policy which only allows transmission if the battery state is above a certain threshold. By optimizing the threshold values, we show that by carefully addressing the correlation information, the randomness can be turned into an opportunity in some cases providing optimal coordination between transmitters without any collisions. Upon accessing the channel, a wireless transmitter is faced with a transmission medium that exhibits random and time varying properties. A transmitter can adapt its transmission strategy to the specific state of the channel for an efficient transmission of information. This requires a process known as channel sensing to acquire the channel state which is costly in terms of time and energy. The contribution of the channel sensing operation to the energy consumption in EH wireless transmitters is not negligible and requires proper optimization. We developed an intelligent channel sensing strategy for an EH transmitter communicating over a time-correlated wireless channel. Our results demonstrate that, despite the associated time and energy cost, sensing the channel intelligently to track the channel state improves the achievable long-term throughput significantly as compared to the performance of those protocols lacking this ability as well as the one that always senses the channel. Next, we study an EH receiver employing Hybrid Automatic Repeat reQuest (HARQ) to ensure reliable end-to-end communications. In inherently error-prone wireless communications systems, re-transmissions triggered by decoding errors have a major impact on the energy consumption of wireless devices. We take into account the energy consumption induced by HARQ to develop simple-toimplement optimal algorithms that minimizes the number of retransmissions required to successfully decode the packet. The large number of connected edge devices envisioned in future wireless technologies enable a wide range of resources with significant sensing capabilities. The ability to collect various data from the sensors has enabled many exciting smart applications. Providing data at a certain quality greatly improves the performance of many of such applications. However, providing high quality is demanding for energy limited sensors. Thus, in the second part of the thesis, we optimize the sensing resolution of an EH wireless sensor in order to efficiently utilize the harvested energy to maximize an application dependent utilit

    New Directions in Online Learning: Boosting, Partial Information, and Non-Stationarity

    Full text link
    Online learning, where a learning algorithm fits a model on-the-fly with streaming data, has become an important research area in machine learning. Batch learning, where the entire data set has to be available to the learning algorithm, is not always a suitable paradigm for the big data era. It is increasingly common in many practical situations, such as online ads prediction or control of self-driving cars, that data instances naturally arrive in a sequential manner. In these situations, researchers want to update their model in an online fashion. This dissertation pursues several topics at the frontier of online learning research. In Chapter 2 and Chapter 3, the journey starts with online boosting. Online boosting studies how to combine multiple online weak learners to get a stronger learner. Chapter 2 considers online multi-class classification problems. Chapter 3 focuses on the more challenging multi-label ranking problem where there are multiple correct labels and the learner outputs a ranking of labels based on their relevance. In both chapters, an optimal algorithm and an adaptive algorithm are proposed. The optimal algorithms require a minimal number of weak learners to attain the desired accuracy. The adaptive algorithms are practically more useful since they do not require a priori knowledge about the strength of weak learners and are more computationally efficient. The adaptive algorithms are not statistically optimal but they still come with reasonable performance guarantees. The empirical results on real data sets support the theoretical findings and the proposed boosting algorithms outperformed existing competitors on benchmark data sets. Chapter 4 considers the partial information setting, where the learner does not receive the true labels. Partial feedback is common in practice as obtaining complete feedback can be costly. The chapter revisits the boosting algorithms that are presented in Chapter 2 and Chapter 3 and extends them to work with partial information feedback. Despite the learner receiving much less information, comparable performance guarantees can be made. Later in Chapter 5 and Chapter 6, we move on to another interesting area in online learning called restless bandit problems. Unlike the classical (stochastic) multi-armed bandit problems where the reward distributions are unknown but stationary, in restless bandit problems the distributions can change over time. This extra layer of complexity allows us to study more complicated models, but the analysis becomes even more difficult. In restless bandit problems, it is assumed that each arm has a state that evolves according to an unknown Markov process, and the reward distribution depends on the arm's current state. This setting can be thought of as a sub-class of reinforcement learning and the partial observability inherent in this problem makes the analysis very challenging. The well known Thompson Sampling algorithm is analyzed and a Bayesian regret bound for it is derived. Chapter 5 considers the episodic case where the system periodically resets. Chapter 6 extends the analysis to the more challenging non-episodic (i.e., infinite time horizon) case. In both settings, Thompson Sampling algorithms (with slight modifications) enjoy sub-linear regret bounds, and the empirical results on simulated data support this fact. The experiments also suggest the possibility that the algorithm can be used in the frequentist setting even though the theoretical bounds are only shown for the Bayesian regret.PHDStatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/155110/1/yhjung_1.pd
    • …
    corecore