339 research outputs found
LIPIcs, Volume 251, ITCS 2023, Complete Volume
LIPIcs, Volume 251, ITCS 2023, Complete Volum
Bandit Social Learning: Exploration under Myopic Behavior
We study social learning dynamics where the agents collectively follow a
simple multi-armed bandit protocol. Agents arrive sequentially, choose arms and
receive associated rewards. Each agent observes the full history (arms and
rewards) of the previous agents, and there are no private signals. While
collectively the agents face exploration-exploitation tradeoff, each agent acts
myopically, without regards to exploration. Motivating scenarios concern
reviews and ratings on online platforms.
We allow a wide range of myopic behaviors that are consistent with
(parameterized) confidence intervals, including the "unbiased" behavior as well
as various behaviorial biases. While extreme versions of these behaviors
correspond to well-known bandit algorithms, we prove that more moderate
versions lead to stark exploration failures, and consequently to regret rates
that are linear in the number of agents. We provide matching upper bounds on
regret by analyzing "moderately optimistic" agents.
As a special case of independent interest, we obtain a general result on
failure of the greedy algorithm in multi-armed bandits. This is the first such
result in the literature, to the best of our knowledg
Reinforcement learning in large state action spaces
Reinforcement learning (RL) is a promising framework for training intelligent agents which learn to optimize long term utility by directly interacting with the environment. Creating RL methods which scale to large state-action spaces is a critical problem towards ensuring real world deployment of RL systems. However, several challenges limit the applicability of RL to large scale settings. These include difficulties with exploration, low sample efficiency, computational intractability, task constraints like decentralization and lack of guarantees about important properties like performance, generalization and robustness in potentially unseen scenarios.
This thesis is motivated towards bridging the aforementioned gap. We propose several principled algorithms and frameworks for studying and addressing the above challenges RL. The proposed methods cover a wide range of RL settings (single and multi-agent systems (MAS) with all the variations in the latter, prediction and control, model-based and model-free methods, value-based and policy-based methods). In this work we propose the first results on several different problems: e.g. tensorization of the Bellman equation which allows exponential sample efficiency gains (Chapter 4), provable suboptimality arising from structural constraints in MAS(Chapter 3), combinatorial generalization results in cooperative MAS(Chapter 5), generalization results on observation shifts(Chapter 7), learning deterministic policies in a probabilistic RL framework(Chapter 6). Our algorithms exhibit provably enhanced performance and sample efficiency along with better scalability. Additionally, we also shed light on generalization aspects of the agents under different frameworks. These properties have been been driven by the use of several advanced tools (e.g. statistical machine learning, state abstraction, variational inference, tensor theory).
In summary, the contributions in this thesis significantly advance progress towards making RL agents ready for large scale, real world applications
Wardrop Equilibrium Can Be Boundedly Rational: A New Behavioral Theory of Route Choice
As one of the most fundamental concepts in transportation science, Wardrop
equilibrium (WE) has always had a relatively weak behavioral underpinning. To
strengthen this foundation, one must reckon with bounded rationality in human
decision-making processes, such as the lack of accurate information, limited
computing power, and sub-optimal choices. This retreat from behavioral
perfectionism in the literature, however, was typically accompanied by a
conceptual modification of WE. Here we show that giving up perfect rationality
need not force a departure from WE. On the contrary, WE can be reached with
global stability in a routing game played by boundedly rational travelers. We
achieve this result by developing a day-to-day (DTD) dynamical model that
mimics how travelers gradually adjust their route valuations, hence choice
probabilities, based on past experiences. Our model, called cumulative logit
(CULO), resembles the classical DTD models but makes a crucial change: whereas
the classical models assume routes are valued based on the cost averaged over
historical data, ours values the routes based on the cost accumulated. To
describe route choice behaviors, the CULO model only uses two parameters, one
accounting for the rate at which the future route cost is discounted in the
valuation relative to the past ones and the other describing the sensitivity of
route choice probabilities to valuation differences. We prove that the CULO
model always converges to WE, regardless of the initial point, as long as the
behavioral parameters satisfy certain mild conditions. Our theory thus upholds
WE's role as a benchmark in transportation systems analysis. It also resolves
the theoretical challenge posed by Harsanyi's instability problem by explaining
why equally good routes at WE are selected with different probabilities
Context and uncertainty in decisions from experience
From the moment we wake up each morning, we are faced with countless choices. Should we press snooze on our alarm? Have toast or cereal for breakfast? Bring an umbrella? Agree to work on that new project? Go to the gym or eat a whole pizza while watching Netflix? The challenge when studying decision-making is to collapse these diverse scenarios into feasible experimental methods. The standard theoretical approach is to represent options using outcomes and probabilities and this has provided a rationale for studying decisions using gambling tasks. These tasks typically involve repeated choices between a single pair of options and outcomes that are determined probabilistically. Thus, the two sections in this thesis ask a simple question: are we missing something by using pairs of options that are divorced from the context in which we make choices outside the psychology laboratory?
The first section focuses on the impact of extreme outcomes within a decision context. Chapter 2 addresses whether there is a rational explanation for why these outcomes appear in decisions from experience and numerous other cognitive domains. Chapters 3-5 describe six experiments that distinguish between plausible theories based on whether they measure extremity as categorical, ordinal, or continuous; whether extremity refers to the centre, the edges, or neighbouring outcomes; whether outcomes are represented as types or tokens; and whether extreme outcomes are defined using temporal or distributional characteristics. In the second section, we shift our focus to how people perceive uncertainty. We examine a distinction between uncertainty that is attributed to inadequate knowledge and uncertainty that is attributed to an inherently random process. Chapter 6 describes three experiments that examine whether allowing participants to map their uncertainty onto observable variability leads them to perceive it as potentially resolvable rather than purely stochastic. We then examine how this influences whether they seek additional information. In summary, the experiments described in these two sections demonstrate the importance of context and uncertainty in understanding how we make decisions
Efficient Prior-Free Mechanisms for No-Regret Agents
We study a repeated Principal Agent problem between a long lived Principal
and Agent pair in a prior free setting. In our setting, the sequence of
realized states of nature may be adversarially chosen, the Agent is non-myopic,
and the Principal aims for a strong form of policy regret. Following Camara,
Hartline, and Johnson, we model the Agent's long-run behavior with behavioral
assumptions that relax the common prior assumption (for example, that the Agent
has no swap regret). Within this framework, we revisit the mechanism proposed
by Camara et al., which informally uses calibrated forecasts of the unknown
states of nature in place of a common prior. We give two main improvements.
First, we give a mechanism that has an exponentially improved dependence (in
terms of both running time and regret bounds) on the number of distinct states
of nature. To do this, we show that our mechanism does not require truly
calibrated forecasts, but rather forecasts that are unbiased subject to only a
polynomially sized collection of events -- which can be produced with
polynomial overhead. Second, in several important special cases -- including
the focal linear contracting setting -- we show how to remove strong
``Alignment'' assumptions (which informally require that near-ties are always
broken in favor of the Principal) by specifically deploying ``stable'' policies
that do not have any near ties that are payoff relevant to the Principal. Taken
together, our new mechanism makes the compelling framework proposed by Camara
et al. much more powerful, now able to be realized over polynomially sized
state spaces, and while requiring only mild assumptions on Agent behavior
Hybrid Algorithm Selection and Hyperparameter Tuning on Distributed Machine Learning Resources: A Hierarchical Agent-based Approach
Algorithm selection and hyperparameter tuning are critical steps in both
academic and applied machine learning. On the other hand, these steps are
becoming ever increasingly delicate due to the extensive rise in the number,
diversity, and distributedness of machine learning resources. Multi-agent
systems, when applied to the design of machine learning platforms, bring about
several distinctive characteristics such as scalability, flexibility, and
robustness, just to name a few. This paper proposes a fully automatic and
collaborative agent-based mechanism for selecting distributedly organized
machine learning algorithms and simultaneously tuning their hyperparameters.
Our method builds upon an existing agent-based hierarchical machine-learning
platform and augments its query structure to support the aforementioned
functionalities without being limited to specific learning, selection, and
tuning mechanisms. We have conducted theoretical assessments, formal
verification, and analytical study to demonstrate the correctness, resource
utilization, and computational efficiency of our technique. According to the
results, our solution is totally correct and exhibits linear time and space
complexity in relation to the size of available resources. To provide concrete
examples of how the proposed methodologies can effectively adapt and perform
across a range of algorithmic options and datasets, we have also conducted a
series of experiments using a system comprised of 24 algorithms and 9 datasets
- …