186 research outputs found

    Laplacian-regularized graph bandits: Algorithms and theoretical analysis

    Get PDF
    We consider a stochastic linear bandit problem with multiple users, where the relationship between users is captured by an underlying graph and user preferences are represented as smooth signals on the graph. We introduce a novel bandit algorithm where the smoothness prior is imposed via the random-walk graph Laplacian, which leads to a single-user cumulative regret scaling as O~(ΨdT)\tilde{\mathcal{O}}(\Psi d \sqrt{T}) with time horizon TT, feature dimensionality dd, and the scalar parameter Ψ∈(0,1)\Psi \in (0,1) that depends on the graph connectivity. This is an improvement over O~(dT)\tilde{\mathcal{O}}(d \sqrt{T}) in \algo{LinUCB}~\Ccite{li2010contextual}, where user relationship is not taken into account. In terms of network regret (sum of cumulative regret over nn users), the proposed algorithm leads to a scaling as O~(ΨdnT)\tilde{\mathcal{O}}(\Psi d\sqrt{nT}), which is a significant improvement over O~(ndT)\tilde{\mathcal{O}}(nd\sqrt{T}) in the state-of-the-art algorithm \algo{Gob.Lin} \Ccite{cesa2013gang}. To improve scalability, we further propose a simplified algorithm with a linear computational complexity with respect to the number of users, while maintaining the same regret. Finally, we present a finite-time analysis on the proposed algorithms, and demonstrate their advantage in comparison with state-of-the-art graph-based bandit algorithms on both synthetic and real-world data

    Laplacian-regularized graph bandits: algorithms and theoretical analysis

    Get PDF
    We consider a stochastic linear bandit problem with multiple users, where the relationship between users is captured by an underlying graph and user preferences are represented as smooth signals on the graph. We introduce a novel bandit algorithm where the smoothness prior is imposed via the random-walk graph Laplacian, which leads to a singleuser cumulative regret scaling as Õ(Ψd√T) with time horizon T, feature dimensionality d, and the scalar parameter Ψ ∈ (0, 1) that depends on the graph connectivity. This is an improvement over Õ(d,√T) in LinUCB [Li et al., 2010], where user relationship is not taken into account. In terms of network regret (sum of cumulative regret over n users), the proposed algorithm leads to a scaling as Õ(Ψd√nT), which is a significant improvement over Õ(nd√T) in the state-of-the-art algorithm Gob.Lin [Cesa-Bianchi et al., 2013]. To improve scalability, we further propose a simplified algorithm with a linear computational complexity with respect to the number of users, while maintaining the same regret. Finally, we present a finite-time analysis on the proposed algorithms, and demonstrate their advantage in comparison with state-of-the-art graph-based bandit algorithms on both synthetic and real-world data

    A Gang of Adversarial Bandits

    Get PDF
    We consider running multiple instances of multi-armed bandit (MAB) problems in parallel. A main motivation for this study are online recommendation systems, in which each of N users is associated with a MAB problem and the goal is to exploit users' similarity in order to learn users' preferences to K items more efficiently. We consider the adversarial MAB setting, whereby an adversary is free to choose which user and which loss to present to the learner during the learning process. Users are in a social network and the learner is aided by a-priori knowledge of the strengths of the social links between all pairs of users. It is assumed that if the social link between two users is strong then they tend to share the same action. The regret is measured relative to an arbitrary function which maps users to actions. The smoothness of the function is captured by a resistance-based dispersion measure Ψ. We present two learning algorithms, GABA-I and GABA-II which exploit the network structure to bias towards functions of low Ψ values. We show that GABA-I has an expected regret bound of O(pln(N K/Ψ)ΨKT) and per-trial time complexity of O(K ln(N)), whilst GABA-II has a weaker O(pln(N/Ψ) ln(N K/Ψ)ΨKT) regret, but a better O(ln(K) ln(N)) per-trial time complexity. We highlight improvements of both algorithms over running independent standard MABs across users

    Meta-learning with Stochastic Linear Bandits

    Get PDF
    We investigate meta-learning procedures in the setting of stochastic linear bandits tasks. The goal is to select a learning algorithm which works well on average over a class of bandits tasks, that are sampled from a task-distribution. Inspired by recent work on learning-to-learn linear regression, we consider a class of bandit algorithms that implement a regularized version of the well-known OFUL algorithm, where the regularization is a square euclidean distance to a bias vector. We first study the benefit of the biased OFUL algorithm in terms of regret minimization. We then propose two strategies to estimate the bias within the learning-to-learn setting. We show both theoretically and experimentally, that when the number of tasks grows and the variance of the task-distribution is small, our strategies have a significant advantage over learning the tasks in isolation

    Provably Efficient Learning in Partially Observable Contextual Bandit

    Full text link
    In this paper, we investigate transfer learning in partially observable contextual bandits, where agents have limited knowledge from other agents and partial information about hidden confounders. We first convert the problem to identifying or partially identifying causal effects between actions and rewards through optimization problems. To solve these optimization problems, we discretize the original functional constraints of unknown distributions into linear constraints, and sample compatible causal models via sequentially solving linear programmings to obtain causal bounds with the consideration of estimation error. Our sampling algorithms provide desirable convergence results for suitable sampling distributions. We then show how causal bounds can be applied to improving classical bandit algorithms and affect the regrets with respect to the size of action sets and function spaces. Notably, in the task with function approximation which allows us to handle general context distributions, our method improves the order dependence on function space size compared with previous literatures. We formally prove that our causally enhanced algorithms outperform classical bandit algorithms and achieve orders of magnitude faster convergence rates. Finally, we perform simulations that demonstrate the efficiency of our strategy compared to the current state-of-the-art methods. This research has the potential to enhance the performance of contextual bandit agents in real-world applications where data is scarce and costly to obtain.Comment: 47 page

    Contextual Bandit Modeling for Dynamic Runtime Control in Computer Systems

    Get PDF
    Modern operating systems and microarchitectures provide a myriad of mechanisms for monitoring and affecting system operation and resource utilization at runtime. Dynamic runtime control of these mechanisms can tailor system operation to the characteristics and behavior of the current workload, resulting in improved performance. However, developing effective models for system control can be challenging. Existing methods often require extensive manual effort, computation time, and domain knowledge to identify relevant low-level performance metrics, relate low-level performance metrics and high-level control decisions to workload performance, and to evaluate the resulting control models. This dissertation develops a general framework, based on the contextual bandit, for describing and learning effective models for runtime system control. Random profiling is used to characterize the relationship between workload behavior, system configuration, and performance. The framework is evaluated in the context of two applications of progressive complexity; first, the selection of paging modes (Shadow Paging, Hardware-Assisted Page) in the Xen virtual machine memory manager; second, the utilization of hardware memory prefetching for multi-core, multi-tenant workloads with cross-core contention for shared memory resources, such as the last-level cache and memory bandwidth. The resulting models for both applications are competitive in comparison to existing runtime control approaches. For paging mode selection, the resulting model provides equivalent performance to the state of the art while substantially reducing the computation requirements of profiling. For hardware memory prefetcher utilization, the resulting models are the first to provide dynamic control for hardware prefetchers using workload statistics. Finally, a correlation-based feature selection method is evaluated for identifying relevant low-level performance metrics related to hardware memory prefetching
    • …
    corecore