1,848 research outputs found
Deep Ordinal Reinforcement Learning
Reinforcement learning usually makes use of numerical rewards, which have
nice properties but also come with drawbacks and difficulties. Using rewards on
an ordinal scale (ordinal rewards) is an alternative to numerical rewards that
has received more attention in recent years. In this paper, a general approach
to adapting reinforcement learning problems to the use of ordinal rewards is
presented and motivated. We show how to convert common reinforcement learning
algorithms to an ordinal variation by the example of Q-learning and introduce
Ordinal Deep Q-Networks, which adapt deep reinforcement learning to ordinal
rewards. Additionally, we run evaluations on problems provided by the OpenAI
Gym framework, showing that our ordinal variants exhibit a performance that is
comparable to the numerical variations for a number of problems. We also give
first evidence that our ordinal variant is able to produce better results for
problems with less engineered and simpler-to-design reward signals.Comment: replaced figures for better visibility, added github repository, more
details about source of experimental results, updated target value
calculation for standard and ordinal Deep Q-Networ
Water resources management in a homogenizing world: Averting the Growth and Underinvestment trajectory
Biotic homogenization, a de facto symptom of a global biodiversity crisis, underscores the urgency of reforming water resources management to focus on the health and viability of ecosystems. Global population and economic growth, coupled with inadequate investment in maintenance of ecological systems, threaten to degrade environmental integrity and ecosystem services that support the global socioeconomic system, indicative of a system governed by the Growth and Underinvestment (G&U) archetype. Water resources management is linked to biotic homogenization and degradation of system integrity through alteration of water systems, ecosystem dynamics, and composition of the biota. Consistent with the G&U archetype, water resources planning primarily treats ecological considerations as exogenous constraints rather than integral, dynamic, and responsive parts of the system. It is essential that the ecological considerations be made objectives of water resources development plans to facilitate the analysis of feedbacks and potential trade-offs between socioeconomic gains and ecological losses. We call for expediting a shift to ecosystem-based management of water resources, which requires a better understanding of the dynamics and links between water resources management actions, ecological side-effects, and associated long-term ramifications for sustainability. To address existing knowledge gaps, models that include dynamics and estimated thresholds for regime shifts or ecosystem degradation need to be developed. Policy levers for implementation of ecosystem-based water resources management include shifting away from growth-oriented supply management, better demand management, increased public awareness, and institutional reform that promotes adaptive and transdisciplinary management approaches
Learning Best Response Strategies for Agents in Ad Exchanges
Ad exchanges are widely used in platforms for online display advertising.
Autonomous agents operating in these exchanges must learn policies for
interacting profitably with a diverse, continually changing, but unknown
market. We consider this problem from the perspective of a publisher,
strategically interacting with an advertiser through a posted price mechanism.
The learning problem for this agent is made difficult by the fact that
information is censored, i.e., the publisher knows if an impression is sold but
no other quantitative information. We address this problem using the
Harsanyi-Bellman Ad Hoc Coordination (HBA) algorithm, which conceptualises this
interaction in terms of a Stochastic Bayesian Game and arrives at optimal
actions by best responding with respect to probabilistic beliefs maintained
over a candidate set of opponent behaviour profiles. We adapt and apply HBA to
the censored information setting of ad exchanges. Also, addressing the case of
stochastic opponents, we devise a strategy based on a Kaplan-Meier estimator
for opponent modelling. We evaluate the proposed method using simulations
wherein we show that HBA-KM achieves substantially better competitive ratio and
lower variance of return than baselines, including a Q-learning agent and a
UCB-based online learning agent, and comparable to the offline optimal
algorithm
Recommended from our members
Performance Enhancement of Deep Reinforcement Learning Networks using Feature Extraction
The combination of Deep Learning and Reinforcement Learning, termed Deep Reinforcement Learning Networks (DRLN), offers the possibility of using a Deep Learning Neural Network to produce an approximate Reinforcement Learning value table that allows extraction of features from neurons in the hidden layers of the network. This paper presents a two stage technique for training a DRLN on features extracted from a DRLN trained on a identical problem, via the implementation of the Q-Learning algorithm, using TensorFlow. The results show that the extraction of features from the hidden layers of the Deep Q-Network improves the learning process of the agent (4.58 times faster and better) and proves the existence of encoded information about the environment which can be used to select the best action. The research contributes preliminary work in an ongoing research project in modeling features extracted from DRLNs
Pseudorehearsal in value function approximation
Catastrophic forgetting is of special importance in reinforcement learning,
as the data distribution is generally non-stationary over time. We study and
compare several pseudorehearsal approaches for Q-learning with function
approximation in a pole balancing task. We have found that pseudorehearsal
seems to assist learning even in such very simple problems, given proper
initialization of the rehearsal parameters
Identifying Critical States by the Action-Based Variance of Expected Return
The balance of exploration and exploitation plays a crucial role in
accelerating reinforcement learning (RL). To deploy an RL agent in human
society, its explainability is also essential. However, basic RL approaches
have difficulties in deciding when to choose exploitation as well as in
extracting useful points for a brief explanation of its operation. One reason
for the difficulties is that these approaches treat all states the same way.
Here, we show that identifying critical states and treating them specially is
commonly beneficial to both problems. These critical states are the states at
which the action selection changes the potential of success and failure
substantially. We propose to identify the critical states using the variance in
the Q-function for the actions and to perform exploitation with high
probability on the identified states. These simple methods accelerate RL in a
grid world with cliffs and two baseline tasks of deep RL. Our results also
demonstrate that the identified critical states are intuitively interpretable
regarding the crucial nature of the action selection. Furthermore, our analysis
of the relationship between the timing of the identification of especially
critical states and the rapid progress of learning suggests there are a few
especially critical states that have important information for accelerating RL
rapidly.Comment: 12 pages, 6 figure
Adherence and persistence to direct oral anticoagulants in atrial fibrillation: a population-based study
Background Despite simpler regimens than vitamin K antagonists (VKAs) for stroke prevention in atrial fibrillation (AF), adherence (taking drugs as prescribed) and persistence (continuation of drugs) to direct oral anticoagulants are suboptimal, yet understudied in electronic health records (EHRs).
Objective We investigated (1) time trends at individual and system levels, and (2) the risk factors for and associations between adherence and persistence.
Methods In UK primary care EHR (The Health Information Network 2011–2016), we investigated adherence and persistence at 1 year for oral anticoagulants (OACs) in adults with incident AF. Baseline characteristics were analysed by OAC and adherence/persistence status. Risk factors for non-adherence and non-persistence were assessed using Cox and logistic regression. Patterns of adherence and persistence were analysed.
Results Among 36 652 individuals with incident AF, cardiovascular comorbidities (median CHA2DS2VASc[Congestive heart failure, Hypertension, Age≥75 years, Diabetes mellitus, Stroke, Vascular disease, Age 65-74 years, Sex category] 3) and polypharmacy (median number of drugs 6) were common. Adherence was 55.2% (95% CI 54.6 to 55.7), 51.2% (95% CI 50.6 to 51.8), 66.5% (95% CI 63.7 to 69.2), 63.1% (95% CI 61.8 to 64.4) and 64.7% (95% CI 63.2 to 66.1) for all OACs, VKA, dabigatran, rivaroxaban and apixaban. One-year persistence was 65.9% (95% CI 65.4 to 66.5), 63.4% (95% CI 62.8 to 64.0), 61.4% (95% CI 58.3 to 64.2), 72.3% (95% CI 70.9 to 73.7) and 78.7% (95% CI 77.1 to 80.1) for all OACs, VKA, dabigatran, rivaroxaban and apixaban. Risk of non-adherence and non-persistence increased over time at individual and system levels. Increasing comorbidity was associated with reduced risk of non-adherence and non-persistence across all OACs. Overall rates of ‘primary non-adherence’ (stopping after first prescription), ‘non-adherent non-persistence’ and ‘persistent adherence’ were 3.5%, 26.5% and 40.2%, differing across OACs.
Conclusions Adherence and persistence to OACs are low at 1 year with heterogeneity across drugs and over time at individual and system levels. Better understanding of contributory factors will inform interventions to improve adherence and persistence across OACs in individuals and populations
Learning from Monte Carlo Rollouts with Opponent Models for Playing Tron
This paper describes a novel reinforcement learning system for learning to play the game of Tron. The system combines Q-learning, multi-layer perceptrons, vision grids, opponent modelling, and Monte Carlo rollouts in a novel way. By learning an opponent model, Monte Carlo rollouts can be effectively applied to generate state trajectories for all possible actions from which improved action estimates can be computed. This allows to extend experience replay by making it possible to update the state-action values of all actions in a given game state simultaneously. The results show that the use of experience replay that updates the Q-values of all actions simultaneously strongly outperforms the conventional experience replay that only updates the Q-value of the performed action. The results also show that using short or long rollout horizons during training lead to similar good performances against two fixed opponents
Novel insights into diminished cardiac reserve in non-obstructive hypertrophic cardiomyopathy from four-dimensional flow cardiac magnetic resonance component analysis
Aims: Hypertrophic cardiomyopathy (HCM) is characterized by hypercontractility and diastolic dysfunction, which alter blood flow haemodynamics and are linked with increased risk of adverse clinical events. Four-dimensional flow cardiac magnetic resonance (4D-flow CMR) enables comprehensive characterization of ventricular blood flow patterns. We characterized flow component changes in non-obstructive HCM and assessed their relationship with phenotypic severity and sudden cardiac death (SCD) risk.
Methods and results: Fifty-one participants (37 non-obstructive HCM and 14 matched controls) underwent 4D-flow CMR. Left-ventricular (LV) end-diastolic volume was separated into four components: direct flow (blood transiting the ventricle within one cycle), retained inflow (blood entering the ventricle and retained for one cycle), delayed ejection flow (retained ventricular blood ejected during systole), and residual volume (ventricular blood retained for >two cycles). Flow component distribution and component end-diastolic kinetic energy/mL were estimated. HCM patients demonstrated greater direct flow proportions compared with controls (47.9 ± 9% vs. 39.4 ± 6%, P = 0.002), with reduction in other components. Direct flow proportions correlated with LV mass index (r = 0.40, P = 0.004), end-diastolic volume index (r = −0.40, P = 0.017), and SCD risk (r = 0.34, P = 0.039). In contrast to controls, in HCM, stroke volume decreased with increasing direct flow proportions, indicating diminished volumetric reserve. There was no difference in component end-diastolic kinetic energy/mL.
Conclusion: Non-obstructive HCM possesses a distinctive flow component distribution pattern characterised by greater direct flow proportions, and direct flow-stroke volume uncoupling indicative of diminished cardiac reserve. The correlation of direct flow proportion with phenotypic severity and SCD risk highlight its potential as a novel and sensitive haemodynamic measure of cardiovascular risk in HCM
- …