6 research outputs found

    Dynamic pricing and learning: historical origins, current research, and new directions

    Get PDF

    Dynamic Pricing and Learning: Historical Origins, Current Research, and New Directions

    Full text link

    Hardware-Efficient Scalable Reinforcement Learning Systems

    Get PDF
    Reinforcement Learning (RL) is a machine learning discipline in which an agent learns by interacting with its environment. In this paradigm, the agent is required to perceive its state and take actions accordingly. Upon taking each action, a numerical reward is provided by the environment. The goal of the agent is thus to maximize the aggregate rewards it receives over time. Over the past two decades, a large variety of algorithms have been proposed to select actions in order to explore the environment and gradually construct an e¤ective strategy that maximizes the rewards. These RL techniques have been successfully applied to numerous real-world, complex applications including board games and motor control tasks. Almost all RL algorithms involve the estimation of a value function, which indicates how good it is for the agent to be in a given state, in terms of the total expected reward in the long run. Alternatively, the value function may re‡ect on the impact of taking a particular action at a given state. The most fundamental approach for constructing such a value function consists of updating a table that contains a value for each state (or each state-action pair). However, this approach is impractical for large scale problems, in which the state and/or action spaces are large. In order to deal with such problems, it is necessary to exploit the generalization capabilities of non-linear function approximators, such as arti…cial neural networks. This dissertation focuses on practical methodologies for solving reinforcement learning problems with large state and/or action spaces. In particular, the work addresses scenarios in which an agent does not have full knowledge of its state, but rather receives partial information about its environment via sensory-based observations. In order to address such intricate problems, novel solutions for both tabular and function-approximation based RL frameworks are proposed. A resource-efficient recurrent neural network algorithm is presented, which exploits adaptive step-size techniques to improve learning characteristics. Moreover, a consolidated actor-critic network is introduced, which omits the modeling redundancy found in typical actor-critic systems. Pivotal concerns are the scalability and speed of the learning algorithms, for which we devise architectures that map efficiently to hardware. As a result, a high degree of parallelism can be achieved. Simulation results that correspond to relevant testbench problems clearly demonstrate the solid performance attributes of the proposed solutions

    Evaluating reinforcement learning for game theory application learning to price airline seats under competition

    No full text
    Applied Game Theory has been criticised for not being able to model real decision making situations. A game's sensitive nature and the difficultly in determining the utility payoff functions make it hard for a decision maker to rely upon any game theoretic results. Therefore the models tend to be simple due to the complexity of solving them (i.e. finding the equilibrium).In recent years, due to the increases of computing power, different computer modelling techniques have been applied in Game Theory. A major example is Artificial Intelligence methods e.g. Genetic Algorithms, Neural Networks and Reinforcement Learning (RL). These techniques allow the modeller to incorporate Game Theory within their models (or simulation) without necessarily knowing the optimal solution. After a warm up period of repeated episodes is run, the model learns to play the game well (though not necessarily optimally). This is a form of simulation-optimization.The objective of the research is to investigate the practical usage of RL within a simple sequential stochastic airline seat pricing game. Different forms of RL are considered and compared to the optimal policy, which is found using standard dynamic programming techniques. The airline game and RL methods displays various interesting phenomena, which are also discussed. For completeness, convergence proofs for the RL algorithms were constructed

    Approximate dynamic programming application to inventory management

    Get PDF
    2010 Summer.Includes bibliographical references.This study has developed a new method and investigated the performance of current Approximate Dynamic Programming (ADP) approaches in the context of common inventory circumstances that have not been adequately studied in the literature. The new method uses a technique similar to eligibility trace [113] to improve performance of the residual gradient method [7]. The ADP approach uses approximation techniques, including learning and simulation schemes, to provide the flexible and adaptive control needed for practical inventory management. However, though ADP has received extensive attention in inventory management research lately, there are still many issues left uninvestigated. Some of the issues include (1) an application of ADP with a scalable, linear operating capable, and universal approximation function, i.e., Radial Basis Function (RBF); (2) performance of bootstrapping and convergence-guaranteed learning schemes, i.e., Eligibility Trace and Residual Gradient, respectively; (3) an effect of latent state variables, introduced by recently found GARCH(1,1), to a model-free property of learning-based ADPs; and (4) a performance comparison between two main ADP families, learning-based and simulation-based ADPs. The purpose of this study is to determine appropriate ADP components and corresponding settings for practical inventory problems by examining these issues. A series of simulation-based experiments are employed to study each of the ADP issues. Due to its simplicity in implementation and popularity as a benchmark in ADP research, the Look-Ahead method is used as a benchmark in this study. Conclusions are drawn mainly based on the significance test with aggregate costs as performance measurement. The performance of each ADP method was tested to be comparable to Look-Ahead for inventory problems with low variance demand and shown to have significantly better performance than performance of Look-Ahead, at 0.05 significance level, for an inventory problem with high variance demand. The analysis of experimental results shows that (1) RBF, with evenly distributed centers and half midpoint effect scales, is an effective approximate cost-to-go method; (2) Sarsa, a widely used algorithm based on one-step temporal difference learning. (TD0), is the most efficient learning scheme compared to its eligibility trace enhancement, Sarsa(λ),or to the Residual Gradient method; (3) the new method, Direct Credit Back, works significantly better than the benchmark Look-Ahead, but it does not show significant improvement over Residual Gradient in either zero or one-period leadtime problem; (4) a model-free property of learning-based ADPs is affirmed under the presence of GARCH(1,1) latent state variables; and (5) performance of a simulation-based ADP, i.e., Rollout and Hindsight Optimization, is superior to performance of a learning-based ADP. In addition, links between ADP setting, i.e., Sarsa(λ)'s Eligibility Trace factor and Rollout's number of simulations and horizon, and conservative behavior, Le., maintaining higher inventory level, have been found. Our conclusions show agreement with theoretical and early speculations on ADP applicability, RBF and TD0 effectiveness, learning-based ADP's model-free property, and that there is an advantage of simulation-based ADP. On the other hand, our findings contradict any significance of GARCH(1,1) awareness, identified by Zhang [130], at least when a learning-based ADP is used. The work presented here has profound implications for future studies of adaptive control for practical inventory management and may one day help solve the problem associated with stochastic supply chain management
    corecore