70 research outputs found

    Efficient Rate-Constrained Nash Equilibrium in Collision Channels with State Information

    Get PDF

    Basis Expansion in Natural Actor Critic Methods

    Get PDF
    International audienceIn reinforcement learning, the aim of the agent is to find a policy that maximizes its expected return. Policy gradient methods try to accomplish this goal by directly approximating the policy using a parametric function approximator; the expected return of the current policy is estimated and its parameters are updated by steepest ascent in the direction of the gradient of the expected return with respect to the policy parameters. In general, the policy is defined in terms of a set of basis functions that capture important features of the problem. Since the quality of the resulting policies directly depend on the set of basis func- tions, and defining them gets harder as the complexity of the problem increases, it is important to be able to find them automatically. In this paper, we propose a new approach which uses cascade-correlation learn- ing architecture for automatically constructing a set of basis functions within the context of Natural Actor-Critic (NAC) algorithms. Such basis functions allow more complex policies be represented, and consequently improve the performance of the resulting policies. We also present the effectiveness of the method empirically

    Tackling bovine TB

    Get PDF
    On 18 December Defra revealed that during 2018, 32,601 badgers were killed, bringing the total number slaughtered under licence since 2013 to almost 67,000.1 ‘Effectiveness’ claims relate not to the impact on cattle TB, but rather to the ability of the contracted shooters to kill sufficient badgers to satisfy their licence requirements, which they can hardly fail to reach given that target numbers are ‘adjusted’ by Natural England part-way through the culls according to the contractors’ progress. Sixty per cent of the badgers have been killed by ‘controlled shooting’, a method rejected by both the government’s Independent Expert Panel2 and the BVA3 because of animal welfare concerns. During 2018 Natural England directly monitored just 89 (0.43 per cent) of controlled shooting events. It is deplorable that the chief veterinary officer (CVO) continues to support the roll-out of a policy that permits controlled shooting, when veterinary organisations have condemned the method on animal welfare grounds. It is particularly concerning that the CVO appears to have deflected responsibility for humaneness to Natural England’s chief scientist who, as far as we are aware, has no background in animal welfare science. It is also unacceptable for government to attribute reductions in herd bovine TB (bTB) incidents over the first four years of culling in the original ‘pilot’ cull zones to its badger culling policy.4 Independent analysis of this and more recent data from the Gloucestershire pilot cull zone5 indicate that new herd incidence is rising sharply, with 22 herd breakdowns in the 12 months to September 2017 (an increase of 29.4 per cent when compared to the 17 breakdowns reported by APHA for the previous 12 months). Analysis of the 2018 figures indicates that both incidence and prevalence are now rising even faster, with a further 24 herd breakdowns occurring between 1 January and 5 December 2018. To date, the government and its officials cite data that are two years out of date, but have declined to comment on this emerging evidence that, far from resulting in a substantial disease control benefit, badger culls may be leading to a sharp increase in bTB in cattle. Natural England’s chief scientist and the UK’s CVO continue to endorse a failing and inhumane policy, bringing their offices into serious disrepute. We urge them, and the BVA, to reconsider their support for further badger culling, and instead focus on the actual cause of bTB’s epidemic spread – a cattle skin test with a sensitivity of only 50 per cent,6,7 and the ongoing problems associated with cattle movements and on-farm biosecurity

    Animal welfare impacts of badger culling operations

    Get PDF
    We are writing to express our extreme concern following recent media coverage1, 2 relating to the methodology being used by contractors to kill badgers under licence, as part of the government’s policy to control bovine TB in cattle. The coverage relates to the shooting of badgers that have been captured in live traps. Covert video footage (https://bit.ly/2Eud1iR ) from Cumbria shows a trapped badger being shot with a firearm at close range, following which it appears to take close to a minute to stop moving. The contractor clearly observes the animal during this time but makes no attempt to expedite the death of the badger and prevent further suffering, as required by the current Natural England best practice guide which states: ‘Immediately after shooting, the animal should be checked to ensure it is dead, and if there is any doubt, a second shot must be taken as soon as possible.’3 The conversation between the contractor and his companion also suggests they were considering moving the badger to another site before finally bagging the carcase, again breaching the best practice guide. While the footage only relates to the experience of a single badger, and while the degree to which the badger was conscious in the period immediately following the shot is unclear, we can by no means be certain that the badger did not suffer. It also raises serious questions about the training, competence and behaviour of contractors, in relation to both badger welfare, and biosecurity. This adds to existing concerns relating to the humaneness of ‘controlled shooting’ (targeting free-roaming badgers with rifles), which continues to be a permitted method under culling licences, in spite of the reservations expressed by both the government-commissioned Independent Expert Panel in its 2014 report,4 and the BVA, which concluded in 2015 that it ‘can no longer support the continued use of controlled shooting as part of the badger control policy’.5 (However, it has since continued to support the issuing of licences which permit the method). The BVA has consistently indicated its support for what it calls the ‘tried and tested’ method of trapping and shooting, but has thus far failed to provide comprehensive and robust evidence for the humaneness of this method. Figure1 Download figure Open in new tab Download powerpoint Natural England reported that its monitors observed 74 (just over 0.6 per cent) of controlled shooting events for accuracy and humaneness During 2017, almost 20,000 badgers were killed under licence across 19 cull zones, around 60 per cent of which were killed by controlled shooting, the remainder being trapped and shot.6 Natural England reported that its monitors observed 74 (just over 0.6 per cent) of controlled shooting events for accuracy and humaneness. No information has been provided on the extent to which trapping and shooting activities were monitored. This raises serious concerns about the extent of suffering that might be experienced by very large numbers of animals, for which contractors are not being held to account. If contractors reach their maximum culling targets set by Natural England for 2018, as many as 41,000 additional badgers could be killed.7 The extent to which these animals will suffer is once again being left in the hands of contractors, with woefully inadequate oversight, and in the face of anecdotal evidence of breaches of best practice guidance. This situation is clearly unacceptable from an animal welfare perspective and it is our view that by endorsing the policy, the BVA is contradicting the principles contained within its own animal welfare strategy.8 We therefore urge the BVA to withdraw its support for any further licensed badger culling, and the RCVS to make it clear that any veterinarian who provides support for culling activities that result in unnecessary and avoidable animal suffering could face disciplinary proceedings. The veterinary profession has no business supporting this licensed mass killing with all its inherent negative welfare and biosecurity implications, and for which the disease control benefits are, at best, extremely uncertain. We believe the continued support for the culls by veterinary bodies in the face of poor evidence for its efficacy damages the credibility of the profession, and that same support in the face of potential animal suffering on a large scale undermines its reputation. We stand ready to discuss these issues in more detail

    Approximate policy iteration: A survey and some new methods

    Get PDF
    We consider the classical policy iteration method of dynamic programming (DP), where approximations and simulation are used to deal with the curse of dimensionality. We survey a number of issues: convergence and rate of convergence of approximate policy evaluation methods, singularity and susceptibility to simulation noise of policy evaluation, exploration issues, constrained and enhanced policy iteration, policy oscillation and chattering, and optimistic and distributed policy iteration. Our discussion of policy evaluation is couched in general terms and aims to unify the available methods in the light of recent research developments and to compare the two main policy evaluation approaches: projected equations and temporal differences (TD), and aggregation. In the context of these approaches, we survey two different types of simulation-based algorithms: matrix inversion methods, such as least-squares temporal difference (LSTD), and iterative methods, such as least-squares policy evaluation (LSPE) and TD (λ), and their scaled variants. We discuss a recent method, based on regression and regularization, which rectifies the unreliability of LSTD for nearly singular projected Bellman equations. An iterative version of this method belongs to the LSPE class of methods and provides the connecting link between LSTD and LSPE. Our discussion of policy improvement focuses on the role of policy oscillation and its effect on performance guarantees. We illustrate that policy evaluation when done by the projected equation/TD approach may lead to policy oscillation, but when done by aggregation it does not. This implies better error bounds and more regular performance for aggregation, at the expense of some loss of generality in cost function representation capability. Hard aggregation provides the connecting link between projected equation/TD-based and aggregation-based policy evaluation, and is characterized by favorable error bounds.National Science Foundation (U.S.) (No.ECCS-0801549)Los Alamos National Laboratory. Information Science and Technology InstituteUnited States. Air Force (No.FA9550-10-1-0412

    Jamming Game in a Dynamic Slotted ALOHA Network

    No full text
    International audienceIn this paper we suggest a development of the channel capacity concept for a dynamic slotted ALOHA network. Our object is to find maxmin successful transmissions of an information over a dynamic communication channel. To do so, we analyze an ALOHA-type medium access control protocol performance in the presence of a jammer. The time is slotted and the system is described as a zero-sum multistage matrix game. Each player, the sender and the jammer, have different costs for respectively sending their packets and jamming, and the jammer wants to minimize the payoff function of the sender. For this case, we found explicit expression of the equilibrium strategies depending on the costs for sending packets and jamming. Properties of the equilibrium are investigated. In particular we have found a simple linear correlation between the probabilities to act for both players in different channel states which are independent on the number of packets left to send. This relation implies that increasing activity of the jammer leads to reducing activity of the user at each channel state. The obtained results are generalized for the case where the channel can be in different states and change according to a Markov rule. Numerical illustrations are performed

    Unified Inter and Intra Options Learning Using Policy Gradient Methods

    No full text
    Abstract. Temporally extended actions (or macro-actions) have proven useful for speeding up planning and learning, adding robustness, and building prior knowledge into AI systems. The options framework, as introduced in Sutton, Precup and Singh (1999), provides a natural way to incorporate macro-actions into reinforcement learning. In the subgoals approach, learning is divided into two phases, first learning each option with a prescribed subgoal, and then learning to compose the learned options together. In this paper we offer a unified framework for concurrent inter- and intra-options learning. To that end, we propose a modular parameterization of intra-option policies together with option termination conditions and the option selection policy (inter options), and show that these three decision components may be viewed as a unified policy over an augmented state-action space, to which standard policy gradient algorithms may be applied. We identify the basis functions that apply to each of these decision components, and show that they possess a useful orthogonality property that allows to compute the natural gradient independently for each component. We further outline the extension of the suggested framework to several levels of options hierarchy, and conclude with a brief illustrative example.

    An Incremental Fast Policy Search Using a Single Sample Path

    No full text
    In this paper, we consider the control problem in a reinforcement learning setting with large state and action spaces. The control problem most commonly addressed in the contemporary literature is to find an optimal policy which optimizes the long run gamma-discounted transition costs, where gamma is an element of 0, 1). They also assume access to a generative model/simulator of the underlying MDP with the hidden premise that realization of the system dynamics of the MDP for arbitrary policies in the form of sample paths can be obtained with ease from the model. In this paper, we consider a cost function which is the expectation of a approximate value function w.r.t. the steady state distribution of the Markov chain induced by the policy, without having access to the generative model. We assume that a single sample path generated using a priori chosen behaviour policy is made available. In this information restricted setting, we solve the generalized control problem using the incremental cross entropy method. The proposed algorithm is shown to converge to the solution which is globally optimal relative to the behaviour policy
    corecore