7 research outputs found
A Relative Exponential Weighing Algorithm for Adversarial Utility-based Dueling Bandits
We study the K-armed dueling bandit problem which is a variation of the
classical Multi-Armed Bandit (MAB) problem in which the learner receives only
relative feedback about the selected pairs of arms. We propose a new algorithm
called Relative Exponential-weight algorithm for Exploration and Exploitation
(REX3) to handle the adversarial utility-based formulation of this problem.
This algorithm is a non-trivial extension of the Exponential-weight algorithm
for Exploration and Exploitation (EXP3) algorithm. We prove a finite time
expected regret upper bound of order O(sqrt(K ln(K)T)) for this algorithm and a
general lower bound of order omega(sqrt(KT)). At the end, we provide
experimental results using real data from information retrieval applications
Adversarial bandit approach for RIS-aided OFDM communication
To assist sixth-generation wireless systems in the management of a wide variety of services, ranging from mission-critical services to safety-critical tasks, key physical layer technologies such as reconfigurable intelligent surfaces (RISs) are proposed. Even though RISs are already used in various scenarios to enable the implementation of smart radio environments, they still face challenges with regard to real-time operation. Specifically, high dimensional fully passive RISs typically need costly system overhead for channel estimation. This paper, however, investigates a semi-passive RIS that requires a very low number of active elements, wherein only two pilots are required per channel coherence time. While in its infant stage, the application of deep learning (DL) tools shows promise in enabling feasible solutions. We propose two low-training overhead and energy-efficient adversarial bandit-based schemes with outstanding performance gains when compared to DL-based reflection beamforming reference methods. The resulting deep learning models are discussed using state-of-the-art model quality prediction trends
Improved Worst-Case Regret Bounds for Randomized Least-Squares Value Iteration
This paper studies regret minimization with randomized value functions in
reinforcement learning. In tabular finite-horizon Markov Decision Processes, we
introduce a clipping variant of one classical Thompson Sampling (TS)-like
algorithm, randomized least-squares value iteration (RLSVI). Our
high-probability worst-case regret bound
improves the previous sharpest worst-case regret bounds for RLSVI and matches
the existing state-of-the-art worst-case TS-based regret bounds.Comment: Updated version, bug fixe