194,726 research outputs found

    Regret-Minimization Algorithms for Multi-Agent Cooperative Learning Systems

    Full text link
    A Multi-Agent Cooperative Learning (MACL) system is an artificial intelligence (AI) system where multiple learning agents work together to complete a common task. Recent empirical success of MACL systems in various domains (e.g. traffic control, cloud computing, robotics) has sparked active research into the design and analysis of MACL systems for sequential decision making problems. One important metric of the learning algorithm for decision making problems is its regret, i.e. the difference between the highest achievable reward and the actual reward that the algorithm gains. The design and development of a MACL system with low-regret learning algorithms can create huge economic values. In this thesis, I analyze MACL systems for different sequential decision making problems. Concretely, the Chapter 3 and 4 investigate the cooperative multi-agent multi-armed bandit problems, with full-information or bandit feedback, in which multiple learning agents can exchange their information through a communication network and the agents can only observe the rewards of the actions they choose. Chapter 5 considers the communication-regret trade-off for online convex optimization in the distributed setting. Chapter 6 discusses how to form high-productive teams for agents based on their unknown but fixed types using adaptive incremental matchings. For the above problems, I present the regret lower bounds for feasible learning algorithms and provide the efficient algorithms to achieve this bound. The regret bounds I present in Chapter 3, 4 and 5 quantify how the regret depends on the connectivity of the communication network and the communication delay, thus giving useful guidance on design of the communication protocol in MACL systemsComment: Thesis submitted to London School of Economics and Political Science for PhD in Statistic

    Online decision problems with large strategy sets

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Mathematics, 2005.Includes bibliographical references (p. 165-171).In an online decision problem, an algorithm performs a sequence of trials, each of which involves selecting one element from a fixed set of alternatives (the "strategy set") whose costs vary over time. After T trials, the combined cost of the algorithm's choices is compared with that of the single strategy whose combined cost is minimum. Their difference is called regret, and one seeks algorithms which are efficient in that their regret is sublinear in T and polynomial in the problem size. We study an important class of online decision problems called generalized multi- armed bandit problems. In the past such problems have found applications in areas as diverse as statistics, computer science, economic theory, and medical decision-making. Most existing algorithms were efficient only in the case of a small (i.e. polynomial- sized) strategy set. We extend the theory by supplying non-trivial algorithms and lower bounds for cases in which the strategy set is much larger (exponential or infinite) and the cost function class is structured, e.g. by constraining the cost functions to be linear or convex. As applications, we consider adaptive routing in networks, adaptive pricing in electronic markets, and collaborative decision-making by untrusting peers in a dynamic environment.by Robert David Kleinberg.Ph.D

    New Models qnd Algorithms for Bandits and Markets

    Get PDF
    Inspired by advertising markets, we consider large-scale sequential decision making problems in which a learner must deploy an algorithm to behave optimally under uncertainty. Although many of these problems can be modeled as contextual bandit problems, we argue that the tools and techniques for analyzing bandit problems with large numbers of actions and contexts can be greatly expanded. While convexity and metric-similarity assumptions on the process generating rewards have yielded some algorithms in existing literature, certain types of assumptions that have been fruitful in offline supervised learning settings have yet to even be considered. Notably missing, for example, is any kind of graphical model approach to assuming structured rewards, despite the success such assumptions have achieved in inducing scalable learning and inference with high-dimensional distributions. Similarly, we observe that there are countless tools for understanding the relationship between a choice of model class in supervised learning, and the generalization error of the best fit from that class, such as the celebrated VC-theory. However, an analogous notion of dimensionality, which relates a generic structural assumption on rewards to regret rates in an online optimization problem, is not fully developed. The primary goal of this dissertation, therefore, will be to fill out the space of models, algorithms, and assumptions used in sequential decision making problems. Toward this end, we will develop a theory for bandit problems with structured rewards that permit a graphical model representation. We will give an efficient algorithm for regret-minimization in such a setting, and along the way will develop a deeper connection between online supervised learning and regret-minimization. This dissertation will also introduce a complexity measure for generic structural assumptions on reward functions, which we call the Haystack Dimension. We will prove that the Haystack Dimension characterizes the optimal rates achievable up to log factors. Finally, we will describe more application-oriented techniques for solving problems in advertising markets, which again demonstrate how methods from traditional disciplines, such as statistical survival analysis, can be leveraged to design novel algorithms for optimization in markets
    • …
    corecore