150 research outputs found

    Actor-Critic Deep Reinforcement Learning for Dynamic Multichannel Access

    Full text link
    We consider the dynamic multichannel access problem, which can be formulated as a partially observable Markov decision process (POMDP). We first propose a model-free actor-critic deep reinforcement learning based framework to explore the sensing policy. To evaluate the performance of the proposed sensing policy and the framework's tolerance against uncertainty, we test the framework in scenarios with different channel switching patterns and consider different switching probabilities. Then, we consider a time-varying environment to identify the adaptive ability of the proposed framework. Additionally, we provide comparisons with the Deep-Q network (DQN) based framework proposed in [1], in terms of both average reward and the time efficiency

    Joint Scheduling and Power-Control for Delay Guarantees in Heterogeneous Cognitive Radios

    Full text link
    An uplink multi secondary user (SU) cognitive radio system having average delay constraints as well as an interference constraint to the primary user (PU) is considered. If the interference channels between the SUs and the PU are statistically heterogeneous due to the different physical locations of the different SUs, the SUs will experience different delay performances. This is because SUs located closer to the PU transmit with lower power levels. Two dynamic scheduling-and-power-allocation policies that can provide the required average delay guarantees to all SUs irrespective of their locations are proposed. The first policy solves the problem when the interference constraint is an instantaneous one, while the second is for problems with long-term average interference constraints. We show that although the average interference problem is an extension to the instantaneous interference one, the solution is totally different. The two policies, derived using the Lyapunov optimization technique, are shown to be asymptotically delay optimal while satisfying the delay and interference constraints. Our findings are supported by extensive system simulations and shown to outperform existing policies as well as shown to be robust to channel estimation errors.Comment: Transactions on Wireless Communications, 2016 Keywords: Cognitive Radios, Delay Constraints, Resource allocation, Stochastic Optimization, Online Algorithm, Lyapunov Optimization, Average Interference Constraint, Priority Queues. arXiv admin note: substantial text overlap with arXiv:1601.00608, arXiv:1512.0298

    Sequential Decision Making with Limited Observation Capability: Application to Wireless Networks

    Full text link
    This work studies a generalized class of restless multi-armed bandits with hidden states and allow cumulative feedback, as opposed to the conventional instantaneous feedback. We call them lazy restless bandits (LRB) as the events of decision-making are sparser than events of state transition. Hence, feedback after each decision event is the cumulative effect of the following state transition events. The states of arms are hidden from the decision-maker and rewards for actions are state dependent. The decision-maker needs to choose one arm in each decision interval, such that long term cumulative reward is maximized. As the states are hidden, the decision-maker maintains and updates its belief about them. It is shown that LRBs admit an optimal policy which has threshold structure in belief space. The Whittle-index policy for solving LRB problem is analyzed; indexability of LRBs is shown. Further, closed-form index expressions are provided for two sets of special cases; for more general cases, an algorithm for index computation is provided. An extensive simulation study is presented; Whittle-index, modified Whittle-index and myopic policies are compared. Lagrangian relaxation of the problem provides an upper bound on the optimal value function; it is used to assess the degree of sub-optimality various policies

    Learning in Restless Multi-Armed Bandits via Adaptive Arm Sequencing Rules

    Full text link
    We consider a class of restless multi-armed bandit (RMAB) problems with unknown arm dynamics. At each time, a player chooses an arm out of N arms to play, referred to as an active arm, and receives a random reward from a finite set of reward states. The reward state of the active arm transits according to an unknown Markovian dynamics. The reward state of passive arms (which are not chosen to play at time t) evolves according to an arbitrary unknown random process. The objective is an arm-selection policy that minimizes the regret, defined as the reward loss with respect to a player that always plays the most rewarding arm. This class of RMAB problems has been studied recently in the context of communication networks and financial investment applications. We develop a strategy that selects arms to be played in a consecutive manner, dubbed Adaptive Sequencing Rules (ASR) algorithm. The sequencing rules for selecting arms under the ASR algorithm are adaptively updated and controlled by the current sample reward means. By designing judiciously the adaptive sequencing rules, we show that the ASR algorithm achieves a logarithmic regret order with time, and a finite-sample bound on the regret is established. Although existing methods have shown a logarithmic regret order with time in this RMAB setting, the theoretical analysis shows a significant improvement in the regret scaling with respect to the system parameters under ASR. Extensive simulation results support the theoretical study and demonstrate strong performance of the algorithm as compared to existing methods.Comment: A short version of this paper was presented at IEEE International Symposium on Information Theory (ISIT) 201

    Exploiting Channel Correlation and PU Traffic Memory for Opportunistic Spectrum Scheduling

    Full text link
    We consider a cognitive radio network with multiple primary users (PUs) and one secondary user (SU), where a spectrum server is utilized for spectrum sensing and scheduling the SU to transmit over one of the PU channels opportunistically. One practical yet challenging scenario is when \textit{both} the PU occupancy and the channel fading vary over time and exhibit temporal correlations. Little work has been done for exploiting such temporal memory in the channel fading and the PU occupancy simultaneously for opportunistic spectrum scheduling. A main goal of this work is to understand the intricate tradeoffs resulting from the interactions of the two sets of system states - the channel fading and the PU occupancy, by casting the problem as a partially observable Markov decision process. We first show that a simple greedy policy is optimal in some special cases. To build a clear understanding of the tradeoffs, we then introduce a full-observation genie-aided system, where the spectrum server collects channel fading states from all PU channels. The genie-aided system is used to decompose the tradeoffs in the original system into multiple tiers, which are examined progressively. Numerical examples indicate that the optimal scheduler in the original system, with observation on the scheduled channel only, achieves a performance very close to the genie-aided system. Further, as expected, the optimal policy in the original system significantly outperforms randomized scheduling, pointing to the merit of exploiting the temporal correlation structure in both channel fading and PU occupancy

    A Deep Actor-Critic Reinforcement Learning Framework for Dynamic Multichannel Access

    Full text link
    To make efficient use of limited spectral resources, we in this work propose a deep actor-critic reinforcement learning based framework for dynamic multichannel access. We consider both a single-user case and a scenario in which multiple users attempt to access channels simultaneously. We employ the proposed framework as a single agent in the single-user case, and extend it to a decentralized multi-agent framework in the multi-user scenario. In both cases, we develop algorithms for the actor-critic deep reinforcement learning and evaluate the proposed learning policies via experiments and numerical results. In the single-user model, in order to evaluate the performance of the proposed channel access policy and the framework's tolerance against uncertainty, we explore different channel switching patterns and different switching probabilities. In the case of multiple users, we analyze the probabilities of each user accessing channels with favorable channel conditions and the probability of collision. We also address a time-varying environment to identify the adaptive ability of the proposed framework. Additionally, we provide comparisons (in terms of both the average reward and time efficiency) between the proposed actor-critic deep reinforcement learning framework, Deep-Q network (DQN) based approach, random access, and the optimal policy when the channel dynamics are known.Comment: 14 figures. arXiv admin note: text overlap with arXiv:1810.0369

    On the Whittle Index for Restless Multi-armed Hidden Markov Bandits

    Full text link
    We consider a restless multi-armed bandit in which each arm can be in one of two states. When an arm is sampled, the state of the arm is not available to the sampler. Instead, a binary signal with a known randomness that depends on the state of the arm is available. No signal is available if the arm is not sampled. An arm-dependent reward is accrued from each sampling. In each time step, each arm changes state according to known transition probabilities which in turn depend on whether the arm is sampled or not sampled. Since the state of the arm is never visible and has to be inferred from the current belief and a possible binary signal, we call this the hidden Markov bandit. Our interest is in a policy to select the arm(s) in each time step that maximizes the infinite horizon discounted reward. Specifically, we seek the use of Whittle's index in selecting the arms. We first analyze the single-armed bandit and show that in general, it admits an approximate threshold-type optimal policy when there is a positive reward for the `no-sample' action. We also identify several special cases for which the threshold policy is indeed the optimal policy. Next, we show that such a single-armed bandit also satisfies an approximate-indexability property. For the case when the single-armed bandit admits a threshold-type optimal policy, we perform the calculation of the Whittle index for each arm. Numerical examples illustrate the analytical results.Comment: Revised version, corrected few typo

    Channel Probing in Opportunistic Communication Systems

    Get PDF
    We consider a multi-channel communication system in which a transmitter has access to M channels, but does not know the state of any of the channels. We model the channel state using an ON/OFF Markov process, and allow the transmitter to probe a single channel at predetermined probing intervals to decide over which channel to transmit. For models in which the transmitter must transmit over the probed channel, it has been shown that a myopic policy probing the channel most likely to be ON is optimal. In this paper, we allow the transmitter to select a channel over which to transmit that is potentially different from the probed channel. For a system of two channels, we show that the choice of which channel to probe does not affect the throughput. For a system with many channels, we show that a probing policy that probes the channel that is the second-most likely to be ON results in higher throughput. We extend the channel probing problem to dynamically choose when to probe based on probing history, and characterize the optimal probing policy for various scenarios

    Scheduling in Time-correlated Wireless Networks with Imperfect CSI and Stringent Constraint

    Full text link
    In a wireless network, the efficiency of scheduling algorithms over time-varying channels depends heavily on the accuracy of the Channel State Information (CSI), which is usually quite ``costly'' in terms of consuming network resources. Scheduling in such systems is also subject to stringent constraints such as power and bandwidth, which limit the maximum number of simultaneous transmissions. In the meanwhile, communication channels in wireless systems typically fluctuate in a time-correlated manner. We hence design schedulers to exploit the temporal-correlation inherent in channels with memory and ARQ-styled feedback from the users for better channel state knowledge, under the assumption of Markovian channels and the stringent constraint on the maximum number of simultaneously active users. We model this problem under the framework of a Partially Observable Markov Decision Processes. In recent work, a low-complexity optimal solution was developed for this problem under a long-term time-average resource constraint. However, in real systems with instantaneous resource constraints, how to optimally exploit the temporal correlation and satisfy realistic stringent constraint on the instantaneous service remains elusive. In this work, we incorporate a stringent constraint on the simultaneously scheduled users and propose a low-complexity scheduling algorithm that dynamically implements user scheduling and dummy packet broadcasting. We show that the throughput region of the optimal policy under the long-term average resource constraint can be asymptotically achieved in the stringent constrained scenario by the proposed algorithm, in the many users limiting regime

    Performance of Joint Spectrum Sensing and MAC Algorithms for Multichannel Opportunistic Spectrum Access Ad Hoc Networks

    Full text link
    We present an analytical framework to assess the link layer throughput of multichannel Opportunistic Spectrum Access (OSA) ad hoc networks. Specifically, we focus on analyzing various combinations of collaborative spectrum sensing and Medium Access Control (MAC) protocol abstractions. We decompose collaborative spectrum sensing into layers, parametrize each layer, classify existing solutions, and propose a new protocol called Truncated Time Division Multiple Access (TTDMA) that supports efficient distribution of sensing results in "k out of N" fusion rule. In case of multichannel MAC protocols we evaluate two main approaches of control channel design with (i) dedicated and (ii) hopping channel. We propose to augment these protocols with options of handling secondary user (SU) connections preempted by primary user (PU) by (i) connection buffering until PU departure and (ii) connection switching to a vacant PU channel. By comparing and optimizing different design combinations we show that (i) it is generally better to buffer preempted SU connections than to switch them to PU vacant channels and (ii) TTDMA is a promising design option for collaborative spectrum sensing process when k does not change over time.Comment: 43 pages, 14 figures. Includes a concluding discussion on the validity of the analytical model in P. Pawelczak, S. Pollin, H-S. W. So, A. Bahai, R.V. Prasad, R. Hekmat, Performance Analysis of Multichannel Medium Access Control Algorithms for Opportunistic Spectrum Access, IEEE Transactions on Vehicular Technology, vol. 58, no. 6, pp. 3014-3031, Jul. 200
    corecore