2,707 research outputs found
Learning how to act: making good decisions with machine learning
This thesis is about machine learning and statistical approaches
to decision making. How can we learn from data to anticipate the
consequence of, and optimally select, interventions or actions?
Problems such as deciding which medication to prescribe to
patients, who should be released on bail, and how much to charge
for insurance are ubiquitous, and have far reaching impacts on
our lives. There are two fundamental approaches to learning how
to act: reinforcement learning, in which an agent directly
intervenes in a system and learns from the outcome, and
observational causal inference, whereby we seek to infer the
outcome of an intervention from observing the system.
The goal of this thesis to connect and unify these key
approaches. I introduce causal bandit problems: a synthesis that
combines causal graphical models, which were developed for
observational causal inference, with multi-armed bandit problems,
which are a subset of reinforcement learning problems that are
simple enough to admit formal analysis. I show that knowledge of
the causal structure allows us to transfer information learned
about the outcome of one action to predict the outcome of an
alternate action, yielding a novel form of structure between
bandit arms that cannot be exploited by existing algorithms. I
propose an algorithm for causal bandit problems and prove bounds
on the simple regret demonstrating it is close to mini-max
optimal and better than algorithms that do not use the additional
causal information
New Models qnd Algorithms for Bandits and Markets
Inspired by advertising markets, we consider large-scale sequential decision making problems in which a learner must deploy an algorithm to behave optimally under uncertainty. Although many of these problems can be modeled as contextual bandit problems, we argue that the tools and techniques for analyzing bandit problems with large numbers of actions and contexts can be greatly expanded. While convexity and metric-similarity assumptions on the process generating rewards have yielded some algorithms in existing literature, certain types of assumptions that have been fruitful in offline supervised learning settings have yet to even be considered. Notably missing, for example, is any kind of graphical model approach to assuming structured rewards, despite the success such assumptions have
achieved in inducing scalable learning and inference with high-dimensional distributions. Similarly, we observe that there are countless tools for understanding the relationship between a choice of model class in supervised learning, and the generalization error of the best fit from that class, such as the celebrated VC-theory. However, an analogous notion of dimensionality, which relates a generic structural assumption on rewards to regret rates in an online optimization problem, is not fully developed. The primary goal of this dissertation, therefore, will be to fill out the space of models, algorithms, and assumptions used in sequential decision making problems. Toward this end, we will develop a theory for bandit problems with structured rewards that permit a graphical model representation. We will give an efficient algorithm for regret-minimization in such a setting, and along the way will develop a deeper connection between online supervised learning and regret-minimization. This dissertation will also introduce a complexity measure for generic structural assumptions on reward functions, which we call the Haystack Dimension. We will prove that the Haystack Dimension characterizes the optimal rates achievable up to log factors. Finally, we will describe more application-oriented techniques for solving problems in advertising markets, which again demonstrate how methods from traditional disciplines, such as statistical survival analysis, can be leveraged to design novel algorithms for optimization in markets
Dynamic Rate and Channel Selection in Cognitive Radio Systems
In this paper, we investigate dynamic channel and rate selection in cognitive
radio systems which exploit a large number of channels free from primary users.
In such systems, transmitters may rapidly change the selected (channel, rate)
pair to opportunistically learn and track the pair offering the highest
throughput. We formulate the problem of sequential channel and rate selection
as an online optimization problem, and show its equivalence to a {\it
structured} Multi-Armed Bandit problem. The structure stems from inherent
properties of the achieved throughput as a function of the selected channel and
rate. We derive fundamental performance limits satisfied by {\it any} channel
and rate adaptation algorithm, and propose algorithms that achieve (or
approach) these limits. In turn, the proposed algorithms optimally exploit the
inherent structure of the throughput. We illustrate the efficiency of our
algorithms using both test-bed and simulation experiments, in both stationary
and non-stationary radio environments. In stationary environments, the packet
successful transmission probabilities at the various channel and rate pairs do
not evolve over time, whereas in non-stationary environments, they may evolve.
In practical scenarios, the proposed algorithms are able to track the best
channel and rate quite accurately without the need of any explicit measurement
and feedback of the quality of the various channels.Comment: 19 page
- …