1,697 research outputs found
An Analysis of the Value of Information when Exploring Stochastic, Discrete Multi-Armed Bandits
In this paper, we propose an information-theoretic exploration strategy for
stochastic, discrete multi-armed bandits that achieves optimal regret. Our
strategy is based on the value of information criterion. This criterion
measures the trade-off between policy information and obtainable rewards. High
amounts of policy information are associated with exploration-dominant searches
of the space and yield high rewards. Low amounts of policy information favor
the exploitation of existing knowledge. Information, in this criterion, is
quantified by a parameter that can be varied during search. We demonstrate that
a simulated-annealing-like update of this parameter, with a sufficiently fast
cooling schedule, leads to an optimal regret that is logarithmic with respect
to the number of episodes.Comment: Entrop
Learning from eXtreme Bandit Feedback
We study the problem of batch learning from bandit feedback in the setting of
extremely large action spaces. Learning from extreme bandit feedback is
ubiquitous in recommendation systems, in which billions of decisions are made
over sets consisting of millions of choices in a single day, yielding massive
observational data. In these large-scale real-world applications, supervised
learning frameworks such as eXtreme Multi-label Classification (XMC) are widely
used despite the fact that they incur significant biases due to the mismatch
between bandit feedback and supervised labels. Such biases can be mitigated by
importance sampling techniques, but these techniques suffer from impractical
variance when dealing with a large number of actions. In this paper, we
introduce a selective importance sampling estimator (sIS) that operates in a
significantly more favorable bias-variance regime. The sIS estimator is
obtained by performing importance sampling on the conditional expectation of
the reward with respect to a small subset of actions for each instance (a form
of Rao-Blackwellization). We employ this estimator in a novel algorithmic
procedure -- named Policy Optimization for eXtreme Models (POXM) -- for
learning from bandit feedback on XMC tasks. In POXM, the selected actions for
the sIS estimator are the top-p actions of the logging policy, where p is
adjusted from the data and is significantly smaller than the size of the action
space. We use a supervised-to-bandit conversion on three XMC datasets to
benchmark our POXM method against three competing methods: BanditNet, a
previously applied partial matching pruning strategy, and a supervised learning
baseline. Whereas BanditNet sometimes improves marginally over the logging
policy, our experiments show that POXM systematically and significantly
improves over all baselines
Decentralized Adaptive Helper Selection in Multi-channel P2P Streaming Systems
In Peer-to-Peer (P2P) multichannel live streaming, helper peers with surplus
bandwidth resources act as micro-servers to compensate the server deficiencies
in balancing the resources between different channel overlays. With deployment
of helper level between server and peers, optimizing the user/helper topology
becomes a challenging task since applying well-known reciprocity-based choking
algorithms is impossible due to the one-directional nature of video streaming
from helpers to users. Because of selfish behavior of peers and lack of central
authority among them, selection of helpers requires coordination. In this
paper, we design a distributed online helper selection mechanism which is
adaptable to supply and demand pattern of various video channels. Our solution
for strategic peers' exploitation from the shared resources of helpers is to
guarantee the convergence to correlated equilibria (CE) among the helper
selection strategies. Online convergence to the set of CE is achieved through
the regret-tracking algorithm which tracks the equilibrium in the presence of
stochastic dynamics of helpers' bandwidth. The resulting CE can help us select
proper cooperation policies. Simulation results demonstrate that our algorithm
achieves good convergence, load distribution on helpers and sustainable
streaming rates for peers
Effects of trust-based decision making in disrupted supply chains
The United States has experienced prolonged severe shortages of vital medications over the past two decades. The causes underlying the severity and prolongation of these shortages are complex, in part due to the complexity of the underlying supply chain networks, which involve supplier-buyer interactions across multiple entities with competitive and cooperative goals. This leads to interesting challenges in maintaining consistent interactions and trust among the entities. Furthermore, disruptions in supply chains influence trust by inducing over-reactive behaviors across the network, thereby impacting the ability to consistently meet the resulting fluctuating demand. To explore these issues, we model a pharmaceutical supply chain with boundedly rational artificial decision makers capable of reasoning about the motivations and behaviors of others. We use multiagent simulations where each agent represents a key decision maker in a pharmaceutical supply chain. The agents possess a Theory-of-Mind capability to reason about the beliefs, and past and future behaviors of other agents, which allows them to assess other agents’ trustworthiness. Further, each agent has beliefs about others’ perceptions of its own trustworthiness that, in turn, impact its behavior. Our experiments reveal several counter-intuitive results showing how small, local disruptions can have cascading global consequences that persist over time. For example, a buyer, to protect itself from disruptions, may dynamically shift to ordering from suppliers with a higher perceived trustworthiness, while the supplier may prefer buyers with more stable ordering behavior. This asymmetry can put the trust-sensitive buyer at a disadvantage during shortages. Further, we demonstrate how the timing and scale of disruptions interact with a buyer’s sensitivity to trustworthiness. This interaction can engender different behaviors and impact the overall supply chain performance, either prolonging and exacerbating even small local disruptions, or mitigating a disruption’s effects. Additionally, we discuss the implications of these results for supply chain operations
Learning to Price Supply Chain Contracts against a Learning Retailer
The rise of big data analytics has automated the decision-making of companies
and increased supply chain agility. In this paper, we study the supply chain
contract design problem faced by a data-driven supplier who needs to respond to
the inventory decisions of the downstream retailer. Both the supplier and the
retailer are uncertain about the market demand and need to learn about it
sequentially. The goal for the supplier is to develop data-driven pricing
policies with sublinear regret bounds under a wide range of possible retailer
inventory policies for a fixed time horizon.
To capture the dynamics induced by the retailer's learning policy, we first
make a connection to non-stationary online learning by following the notion of
variation budget. The variation budget quantifies the impact of the retailer's
learning strategy on the supplier's decision-making. We then propose dynamic
pricing policies for the supplier for both discrete and continuous demand. We
also note that our proposed pricing policy only requires access to the support
of the demand distribution, but critically, does not require the supplier to
have any prior knowledge about the retailer's learning policy or the demand
realizations. We examine several well-known data-driven policies for the
retailer, including sample average approximation, distributionally robust
optimization, and parametric approaches, and show that our pricing policies
lead to sublinear regret bounds in all these cases.
At the managerial level, we answer affirmatively that there is a pricing
policy with a sublinear regret bound under a wide range of retailer's learning
policies, even though she faces a learning retailer and an unknown demand
distribution. Our work also provides a novel perspective in data-driven
operations management where the principal has to learn to react to the learning
policies employed by other agents in the system
Data-driven predictive maintenance scheduling policies for railways
Inspection and maintenance activities are essential to preserving safety and cost-effectiveness in railways. However, the stochastic nature of railway defect occurrence is usually ignored in literature; instead, defect stochasticity is considered independently of maintenance scheduling. This study presents a new approach to predict rail and geometry defects that relies on easy-to-obtain data and integrates prediction with inspection and maintenance scheduling activities. In the proposed approach, a novel use of risk-averse and hybrid prediction methodology controls the underestimation of defects. Then, a discounted Markov decision process model utilizes these predictions to determine optimal inspection and maintenance scheduling policies. Furthermore, in the presence of capacity constraints, Whittle indices via the multi-armed restless bandit formulation dynamically provide the optimal policies using the updated transition kernels. Results indicate a high accuracy rate in prediction and effective long-term scheduling policies that are adaptable to changing conditions
Stateful Posted Pricing with Vanishing Regret via Dynamic Deterministic Markov Decision Processes
An online problem called dynamic resource allocation with capacity constraints (DRACC) is introduced and studied in the realm of posted price mechanisms. This problem subsumes several applications of stateful pricing, including but not limited to posted prices for online job scheduling and matching over a dynamic bipartite graph. Because existing online learning techniques do not yield vanishing regret for this problem, we develop a novel online learning framework over deterministic Markov decision processes with dynamic state transition and reward functions. Following that, we prove, based on a reduction to the well-studied problem of online learning with switching costs, that if the Markov decision process admits a chasing oracle (i.e., an oracle that simulates any given policy from any initial state with bounded loss), then the online learning problem can be solved with vanishing regret. Our results for the DRACC problem and its applications are then obtained by devising (randomized and deterministic) chasing oracles that exploit the particular structure of these problems
Models of market integration in Central Asia – comparative performance
The paper considers the problem of the integration of markets in Central Asia as a main factor of economic modernization. It first identifies the potential channels of reduction of transaction costs barriers between countries (“models of integration”). Second, it looks at the emergence of these channels, and identifies two main puzzles: success of centralization in individual countries vs. failing international cooperation among them and successful informal cooperation of companies and trade networks vs. deficits of intergovernmental concerted actions. Third, it looks at the impact of the relative success of emerging models of market integration for the balance of power in Central Asia.Central Asia, regionalization, regionalism, decentralization
- …