1,697 research outputs found

    Albanians and “mountain bandits”

    Get PDF

    An Analysis of the Value of Information when Exploring Stochastic, Discrete Multi-Armed Bandits

    Full text link
    In this paper, we propose an information-theoretic exploration strategy for stochastic, discrete multi-armed bandits that achieves optimal regret. Our strategy is based on the value of information criterion. This criterion measures the trade-off between policy information and obtainable rewards. High amounts of policy information are associated with exploration-dominant searches of the space and yield high rewards. Low amounts of policy information favor the exploitation of existing knowledge. Information, in this criterion, is quantified by a parameter that can be varied during search. We demonstrate that a simulated-annealing-like update of this parameter, with a sufficiently fast cooling schedule, leads to an optimal regret that is logarithmic with respect to the number of episodes.Comment: Entrop

    Learning from eXtreme Bandit Feedback

    Full text link
    We study the problem of batch learning from bandit feedback in the setting of extremely large action spaces. Learning from extreme bandit feedback is ubiquitous in recommendation systems, in which billions of decisions are made over sets consisting of millions of choices in a single day, yielding massive observational data. In these large-scale real-world applications, supervised learning frameworks such as eXtreme Multi-label Classification (XMC) are widely used despite the fact that they incur significant biases due to the mismatch between bandit feedback and supervised labels. Such biases can be mitigated by importance sampling techniques, but these techniques suffer from impractical variance when dealing with a large number of actions. In this paper, we introduce a selective importance sampling estimator (sIS) that operates in a significantly more favorable bias-variance regime. The sIS estimator is obtained by performing importance sampling on the conditional expectation of the reward with respect to a small subset of actions for each instance (a form of Rao-Blackwellization). We employ this estimator in a novel algorithmic procedure -- named Policy Optimization for eXtreme Models (POXM) -- for learning from bandit feedback on XMC tasks. In POXM, the selected actions for the sIS estimator are the top-p actions of the logging policy, where p is adjusted from the data and is significantly smaller than the size of the action space. We use a supervised-to-bandit conversion on three XMC datasets to benchmark our POXM method against three competing methods: BanditNet, a previously applied partial matching pruning strategy, and a supervised learning baseline. Whereas BanditNet sometimes improves marginally over the logging policy, our experiments show that POXM systematically and significantly improves over all baselines

    Decentralized Adaptive Helper Selection in Multi-channel P2P Streaming Systems

    Full text link
    In Peer-to-Peer (P2P) multichannel live streaming, helper peers with surplus bandwidth resources act as micro-servers to compensate the server deficiencies in balancing the resources between different channel overlays. With deployment of helper level between server and peers, optimizing the user/helper topology becomes a challenging task since applying well-known reciprocity-based choking algorithms is impossible due to the one-directional nature of video streaming from helpers to users. Because of selfish behavior of peers and lack of central authority among them, selection of helpers requires coordination. In this paper, we design a distributed online helper selection mechanism which is adaptable to supply and demand pattern of various video channels. Our solution for strategic peers' exploitation from the shared resources of helpers is to guarantee the convergence to correlated equilibria (CE) among the helper selection strategies. Online convergence to the set of CE is achieved through the regret-tracking algorithm which tracks the equilibrium in the presence of stochastic dynamics of helpers' bandwidth. The resulting CE can help us select proper cooperation policies. Simulation results demonstrate that our algorithm achieves good convergence, load distribution on helpers and sustainable streaming rates for peers

    Effects of trust-based decision making in disrupted supply chains

    Get PDF
    The United States has experienced prolonged severe shortages of vital medications over the past two decades. The causes underlying the severity and prolongation of these shortages are complex, in part due to the complexity of the underlying supply chain networks, which involve supplier-buyer interactions across multiple entities with competitive and cooperative goals. This leads to interesting challenges in maintaining consistent interactions and trust among the entities. Furthermore, disruptions in supply chains influence trust by inducing over-reactive behaviors across the network, thereby impacting the ability to consistently meet the resulting fluctuating demand. To explore these issues, we model a pharmaceutical supply chain with boundedly rational artificial decision makers capable of reasoning about the motivations and behaviors of others. We use multiagent simulations where each agent represents a key decision maker in a pharmaceutical supply chain. The agents possess a Theory-of-Mind capability to reason about the beliefs, and past and future behaviors of other agents, which allows them to assess other agents’ trustworthiness. Further, each agent has beliefs about others’ perceptions of its own trustworthiness that, in turn, impact its behavior. Our experiments reveal several counter-intuitive results showing how small, local disruptions can have cascading global consequences that persist over time. For example, a buyer, to protect itself from disruptions, may dynamically shift to ordering from suppliers with a higher perceived trustworthiness, while the supplier may prefer buyers with more stable ordering behavior. This asymmetry can put the trust-sensitive buyer at a disadvantage during shortages. Further, we demonstrate how the timing and scale of disruptions interact with a buyer’s sensitivity to trustworthiness. This interaction can engender different behaviors and impact the overall supply chain performance, either prolonging and exacerbating even small local disruptions, or mitigating a disruption’s effects. Additionally, we discuss the implications of these results for supply chain operations

    Learning to Price Supply Chain Contracts against a Learning Retailer

    Full text link
    The rise of big data analytics has automated the decision-making of companies and increased supply chain agility. In this paper, we study the supply chain contract design problem faced by a data-driven supplier who needs to respond to the inventory decisions of the downstream retailer. Both the supplier and the retailer are uncertain about the market demand and need to learn about it sequentially. The goal for the supplier is to develop data-driven pricing policies with sublinear regret bounds under a wide range of possible retailer inventory policies for a fixed time horizon. To capture the dynamics induced by the retailer's learning policy, we first make a connection to non-stationary online learning by following the notion of variation budget. The variation budget quantifies the impact of the retailer's learning strategy on the supplier's decision-making. We then propose dynamic pricing policies for the supplier for both discrete and continuous demand. We also note that our proposed pricing policy only requires access to the support of the demand distribution, but critically, does not require the supplier to have any prior knowledge about the retailer's learning policy or the demand realizations. We examine several well-known data-driven policies for the retailer, including sample average approximation, distributionally robust optimization, and parametric approaches, and show that our pricing policies lead to sublinear regret bounds in all these cases. At the managerial level, we answer affirmatively that there is a pricing policy with a sublinear regret bound under a wide range of retailer's learning policies, even though she faces a learning retailer and an unknown demand distribution. Our work also provides a novel perspective in data-driven operations management where the principal has to learn to react to the learning policies employed by other agents in the system

    Data-driven predictive maintenance scheduling policies for railways

    Get PDF
    Inspection and maintenance activities are essential to preserving safety and cost-effectiveness in railways. However, the stochastic nature of railway defect occurrence is usually ignored in literature; instead, defect stochasticity is considered independently of maintenance scheduling. This study presents a new approach to predict rail and geometry defects that relies on easy-to-obtain data and integrates prediction with inspection and maintenance scheduling activities. In the proposed approach, a novel use of risk-averse and hybrid prediction methodology controls the underestimation of defects. Then, a discounted Markov decision process model utilizes these predictions to determine optimal inspection and maintenance scheduling policies. Furthermore, in the presence of capacity constraints, Whittle indices via the multi-armed restless bandit formulation dynamically provide the optimal policies using the updated transition kernels. Results indicate a high accuracy rate in prediction and effective long-term scheduling policies that are adaptable to changing conditions

    Stateful Posted Pricing with Vanishing Regret via Dynamic Deterministic Markov Decision Processes

    Get PDF
    An online problem called dynamic resource allocation with capacity constraints (DRACC) is introduced and studied in the realm of posted price mechanisms. This problem subsumes several applications of stateful pricing, including but not limited to posted prices for online job scheduling and matching over a dynamic bipartite graph. Because existing online learning techniques do not yield vanishing regret for this problem, we develop a novel online learning framework over deterministic Markov decision processes with dynamic state transition and reward functions. Following that, we prove, based on a reduction to the well-studied problem of online learning with switching costs, that if the Markov decision process admits a chasing oracle (i.e., an oracle that simulates any given policy from any initial state with bounded loss), then the online learning problem can be solved with vanishing regret. Our results for the DRACC problem and its applications are then obtained by devising (randomized and deterministic) chasing oracles that exploit the particular structure of these problems

    Models of market integration in Central Asia – comparative performance

    Get PDF
    The paper considers the problem of the integration of markets in Central Asia as a main factor of economic modernization. It first identifies the potential channels of reduction of transaction costs barriers between countries (“models of integration”). Second, it looks at the emergence of these channels, and identifies two main puzzles: success of centralization in individual countries vs. failing international cooperation among them and successful informal cooperation of companies and trade networks vs. deficits of intergovernmental concerted actions. Third, it looks at the impact of the relative success of emerging models of market integration for the balance of power in Central Asia.Central Asia, regionalization, regionalism, decentralization
    • …
    corecore