7 research outputs found

    A Relative Exponential Weighing Algorithm for Adversarial Utility-based Dueling Bandits

    Get PDF
    We study the K-armed dueling bandit problem which is a variation of the classical Multi-Armed Bandit (MAB) problem in which the learner receives only relative feedback about the selected pairs of arms. We propose a new algorithm called Relative Exponential-weight algorithm for Exploration and Exploitation (REX3) to handle the adversarial utility-based formulation of this problem. This algorithm is a non-trivial extension of the Exponential-weight algorithm for Exploration and Exploitation (EXP3) algorithm. We prove a finite time expected regret upper bound of order O(sqrt(K ln(K)T)) for this algorithm and a general lower bound of order omega(sqrt(KT)). At the end, we provide experimental results using real data from information retrieval applications

    Adversarial bandit approach for RIS-aided OFDM communication

    Get PDF
    To assist sixth-generation wireless systems in the management of a wide variety of services, ranging from mission-critical services to safety-critical tasks, key physical layer technologies such as reconfigurable intelligent surfaces (RISs) are proposed. Even though RISs are already used in various scenarios to enable the implementation of smart radio environments, they still face challenges with regard to real-time operation. Specifically, high dimensional fully passive RISs typically need costly system overhead for channel estimation. This paper, however, investigates a semi-passive RIS that requires a very low number of active elements, wherein only two pilots are required per channel coherence time. While in its infant stage, the application of deep learning (DL) tools shows promise in enabling feasible solutions. We propose two low-training overhead and energy-efficient adversarial bandit-based schemes with outstanding performance gains when compared to DL-based reflection beamforming reference methods. The resulting deep learning models are discussed using state-of-the-art model quality prediction trends

    Improved Worst-Case Regret Bounds for Randomized Least-Squares Value Iteration

    Full text link
    This paper studies regret minimization with randomized value functions in reinforcement learning. In tabular finite-horizon Markov Decision Processes, we introduce a clipping variant of one classical Thompson Sampling (TS)-like algorithm, randomized least-squares value iteration (RLSVI). Our O~(H2SAT)\tilde{\mathrm{O}}(H^2S\sqrt{AT}) high-probability worst-case regret bound improves the previous sharpest worst-case regret bounds for RLSVI and matches the existing state-of-the-art worst-case TS-based regret bounds.Comment: Updated version, bug fixe
    corecore