752 research outputs found

    Efficient Strategy Iteration for Mean Payoff in Markov Decision Processes

    Full text link
    Markov decision processes (MDPs) are standard models for probabilistic systems with non-deterministic behaviours. Mean payoff (or long-run average reward) provides a mathematically elegant formalism to express performance related properties. Strategy iteration is one of the solution techniques applicable in this context. While in many other contexts it is the technique of choice due to advantages over e.g. value iteration, such as precision or possibility of domain-knowledge-aware initialization, it is rarely used for MDPs, since there it scales worse than value iteration. We provide several techniques that speed up strategy iteration by orders of magnitude for many MDPs, eliminating the performance disadvantage while preserving all its advantages

    Event-Driven Optimal Feedback Control for Multi-Antenna Beamforming

    Full text link
    Transmit beamforming is a simple multi-antenna technique for increasing throughput and the transmission range of a wireless communication system. The required feedback of channel state information (CSI) can potentially result in excessive overhead especially for high mobility or many antennas. This work concerns efficient feedback for transmit beamforming and establishes a new approach of controlling feedback for maximizing net throughput, defined as throughput minus average feedback cost. The feedback controller using a stationary policy turns CSI feedback on/off according to the system state that comprises the channel state and transmit beamformer. Assuming channel isotropy and Markovity, the controller's state reduces to two scalars. This allows the optimal control policy to be efficiently computed using dynamic programming. Consider the perfect feedback channel free of error, where each feedback instant pays a fixed price. The corresponding optimal feedback control policy is proved to be of the threshold type. This result holds regardless of whether the controller's state space is discretized or continuous. Under the threshold-type policy, feedback is performed whenever a state variable indicating the accuracy of transmit CSI is below a threshold, which varies with channel power. The practical finite-rate feedback channel is also considered. The optimal policy for quantized feedback is proved to be also of the threshold type. The effect of CSI quantization is shown to be equivalent to an increment on the feedback price. Moreover, the increment is upper bounded by the expected logarithm of one minus the quantization error. Finally, simulation shows that feedback control increases net throughput of the conventional periodic feedback by up to 0.5 bit/s/Hz without requiring additional bandwidth or antennas.Comment: 29 pages; submitted for publicatio

    Lipschitzness Is All You Need To Tame Off-policy Generative Adversarial Imitation Learning

    Full text link
    Despite the recent success of reinforcement learning in various domains, these approaches remain, for the most part, deterringly sensitive to hyper-parameters and are often riddled with essential engineering feats allowing their success. We consider the case of off-policy generative adversarial imitation learning, and perform an in-depth review, qualitative and quantitative, of the method. We show that forcing the learned reward function to be local Lipschitz-continuous is a sine qua non condition for the method to perform well. We then study the effects of this necessary condition and provide several theoretical results involving the local Lipschitzness of the state-value function. We complement these guarantees with empirical evidence attesting to the strong positive effect that the consistent satisfaction of the Lipschitzness constraint on the reward has on imitation performance. Finally, we tackle a generic pessimistic reward preconditioning add-on spawning a large class of reward shaping methods, which makes the base method it is plugged into provably more robust, as shown in several additional theoretical guarantees. We then discuss these through a fine-grained lens and share our insights. Crucially, the guarantees derived and reported in this work are valid for any reward satisfying the Lipschitzness condition, nothing is specific to imitation. As such, these may be of independent interest

    Inverse Reinforcement Learning in Large State Spaces via Function Approximation

    Get PDF
    This paper introduces a new method for inverse reinforcement learning in large-scale and high-dimensional state spaces. To avoid solving the computationally expensive reinforcement learning problems in reward learning, we propose a function approximation method to ensure that the Bellman Optimality Equation always holds, and then estimate a function to maximize the likelihood of the observed motion. The time complexity of the proposed method is linearly proportional to the cardinality of the action set, thus it can handle large state spaces efficiently. We test the proposed method in a simulated environment, and show that it is more accurate than existing methods and significantly better in scalability. We also show that the proposed method can extend many existing methods to high-dimensional state spaces. We then apply the method to evaluating the effect of rehabilitative stimulations on patients with spinal cord injuries based on the observed patient motions.Comment: Experiment update

    Fuzzifying [sic] Markov decision process

    Get PDF
    Markov decision processes have become an indispensable tool in applications as diverse as equipment maintenance, manufacturing systems, inventory control, queuing networks and investment analysis. Typically we have a controlled Markov chain on a suitable state space in which transitional probabilities depend on the policy (or decision maker) which comes from a set of possible actions. The main problem of interest would be to find an optimal policy that minimizes the associated cost. Linear Programming has been widely used to find the optimal Markov decision policy. It requires solutions of large systems of simultaneous linear equations. By the fact that the complexity in linear programming increases much faster with the increase in the number of states which is often called curse of dimensionality, the linear programming method can handle only small models. This thesis presents a new method to lessen the curse of dimensionality. By assuming certain monotonicity property for the transition probability, it is shown that a fuzzy membership function can be used to reduce the number of states. The use of membership functions help to reduce the number of the states. However all the states remain intact through the use of the membership value. That is, those states eliminated can be recovered through interpolation with the aid of membership functions. This new proposed method is shown to be effective in coping with the curse of dimensionality

    Sensor Path Planning for Emitter Localization

    Get PDF
    The localization of a radio frequency (RF) emitter is relevant in many military and civilian applications. The recent decade has seen a rapid progress in the development of small and mobile unmanned aerial vehicles (UAVs), which offer a way to perform emitter localization autonomously. The path a UAV travels influences the localization significantly, making path planning an important part of a mobile emitter localization system. The topic of this thesis is path planning for a UAV that uses bearing measurements to localize a stationary emitter. Using a directional antenna, the direction towards the target can be determined by the UAV rotating around its own vertical axis. During this rotation the UAV is required to remain at the same position, which induces a trade-off between movement and measurement that influences the optimal trajectories. This thesis derives a novel path planning algorithm for localizing an emitter with a UAV. It improves the current state of the art by providing a localization with defined accuracy in a shorter amount of time compared to other algorithms in simulations. The algorithm uses the policy rollout principle to perform a nonmyopic planning and to incorporate the uncertainty of the estimation process into its decision. The concept of an action selection algorithm for policy rollout is introduced, which allows the use of existing optimization algorithms to effectively search the action space. Multiple action selection algorithms are compared to optimize the speed of the path planning algorithm. Similarly, to reduce computational demand, an adaptive grid-based localizer has been developed. To evaluate the algorithm an experimental system has been built and the algorithm was tested on this system. Based on initial experiments, the path planning algorithm has been modified, including a minimal distance to the emitter and an outlier detection step. The resulting algorithm shows promising results in experimental flights
    corecore