196 research outputs found

    Skyline Identification in Multi-Armed Bandits

    Full text link
    We introduce a variant of the classical PAC multi-armed bandit problem. There is an ordered set of nn arms A[1],,A[n]A[1],\dots,A[n], each with some stochastic reward drawn from some unknown bounded distribution. The goal is to identify the skylineskyline of the set AA, consisting of all arms A[i]A[i] such that A[i]A[i] has larger expected reward than all lower-numbered arms A[1],,A[i1]A[1],\dots,A[i-1]. We define a natural notion of an ε\varepsilon-approximate skyline and prove matching upper and lower bounds for identifying an ε\varepsilon-skyline. Specifically, we show that in order to identify an ε\varepsilon-skyline from among nn arms with probability 1δ1-\delta, Θ(nε2min{log(1εδ),log(nδ)}) \Theta\bigg(\frac{n}{\varepsilon^2} \cdot \min\bigg\{ \log\bigg(\frac{1}{\varepsilon \delta}\bigg), \log\bigg(\frac{n}{\delta}\bigg) \bigg\} \bigg) samples are necessary and sufficient. When ε1/n\varepsilon \gg 1/n, our results improve over the naive algorithm, which draws enough samples to approximate the expected reward of every arm; the algorithm of (Auer et al., AISTATS'16) for Pareto-optimal arm identification is likewise superseded. Our results show that the sample complexity of the skyline problem lies strictly in between that of best arm identification (Even-Dar et al., COLT'02) and that of approximating the expected reward of every arm.Comment: 18 pages, 2 Figures; an ALT'18/ISIT'18 submissio

    Vector Optimization with Stochastic Bandit Feedback

    Full text link
    We introduce vector optimization problems with stochastic bandit feedback, which extends the best arm identification problem to vector-valued rewards. We consider KK designs with multi-dimensional mean reward vectors, which are partially ordered according to a polyhedral ordering cone CC. This generalizes the concept of the Pareto set in multi-objective optimization and allows different sets of preferences of decision-makers to be encoded by CC. Different than prior work, we define approximations of the Pareto set based on direction-free covering and gap notions. We study an (ϵ,δ\epsilon,\delta)-PAC Pareto set identification problem where an evaluation of each design yields a noisy observation of the mean reward vector. In order to characterize the difficulty of learning the Pareto set, we introduce the concept of {\em ordering complexity}, i.e., geometric conditions on the deviations of empirical reward vectors from their mean under which the Pareto front can be approximated accurately. We show how to compute the ordering complexity of any polyhedral ordering cone. We provide gap-dependent and worst-case lower bounds on the sample complexity and show that in the worst-case the sample complexity scales with the square of ordering complexity. Furthermore, we investigate the sample complexity of the na\"ive elimination algorithm and prove that it nearly matches the worst-case sample complexity. Finally, we run experiments to verify our theoretical results and illustrate how CC and sampling budget affect the Pareto set, returned (ϵ,δ\epsilon,\delta)-PAC Pareto set and the success of identification.Comment: 28 pages, 3 tables, 1 figur

    Thirty Years of Machine Learning: The Road to Pareto-Optimal Wireless Networks

    Full text link
    Future wireless networks have a substantial potential in terms of supporting a broad range of complex compelling applications both in military and civilian fields, where the users are able to enjoy high-rate, low-latency, low-cost and reliable information services. Achieving this ambitious goal requires new radio techniques for adaptive learning and intelligent decision making because of the complex heterogeneous nature of the network structures and wireless services. Machine learning (ML) algorithms have great success in supporting big data analytics, efficient parameter estimation and interactive decision making. Hence, in this article, we review the thirty-year history of ML by elaborating on supervised learning, unsupervised learning, reinforcement learning and deep learning. Furthermore, we investigate their employment in the compelling applications of wireless networks, including heterogeneous networks (HetNets), cognitive radios (CR), Internet of things (IoT), machine to machine networks (M2M), and so on. This article aims for assisting the readers in clarifying the motivation and methodology of the various ML algorithms, so as to invoke them for hitherto unexplored services as well as scenarios of future wireless networks.Comment: 46 pages, 22 fig

    Pareto Front Identification with Regret Minimization

    Full text link
    We consider Pareto front identification for linear bandits (PFILin) where the goal is to identify a set of arms whose reward vectors are not dominated by any of the others when the mean reward vector is a linear function of the context. PFILin includes the best arm identification problem and multi-objective active learning as special cases. The sample complexity of our proposed algorithm is O~(d/Δ2)\tilde{O}(d/\Delta^2), where dd is the dimension of contexts and Δ\Delta is a measure of problem complexity. Our sample complexity is optimal up to a logarithmic factor. A novel feature of our algorithm is that it uses the contexts of all actions. In addition to efficiently identifying the Pareto front, our algorithm also guarantees O~(d/t)\tilde{O}(\sqrt{d/t}) bound for instantaneous Pareto regret when the number of samples is larger than Ω(dlogdL)\Omega(d\log dL) for LL dimensional vector rewards. By using the contexts of all arms, our proposed algorithm simultaneously provides efficient Pareto front identification and regret minimization. Numerical experiments demonstrate that the proposed algorithm successfully identifies the Pareto front while minimizing the regret.Comment: 25 pages including appendi

    Ordered Preference Elicitation Strategies for Supporting Multi-Objective Decision Making

    Full text link
    In multi-objective decision planning and learning, much attention is paid to producing optimal solution sets that contain an optimal policy for every possible user preference profile. We argue that the step that follows, i.e, determining which policy to execute by maximising the user's intrinsic utility function over this (possibly infinite) set, is under-studied. This paper aims to fill this gap. We build on previous work on Gaussian processes and pairwise comparisons for preference modelling, extend it to the multi-objective decision support scenario, and propose new ordered preference elicitation strategies based on ranking and clustering. Our main contribution is an in-depth evaluation of these strategies using computer and human-based experiments. We show that our proposed elicitation strategies outperform the currently used pairwise methods, and found that users prefer ranking most. Our experiments further show that utilising monotonicity information in GPs by using a linear prior mean at the start and virtual comparisons to the nadir and ideal points, increases performance. We demonstrate our decision support framework in a real-world study on traffic regulation, conducted with the city of Amsterdam.Comment: AAMAS 2018, Source code at https://github.com/lmzintgraf/gp_pref_elici
    corecore