14,396 research outputs found
Managing engineering systems with large state and action spaces through deep reinforcement learning
Decision-making for engineering systems can be efficiently formulated as a
Markov Decision Process (MDP) or a Partially Observable MDP (POMDP). Typical
MDP and POMDP solution procedures utilize offline knowledge about the
environment and provide detailed policies for relatively small systems with
tractable state and action spaces. However, in large multi-component systems
the sizes of these spaces easily explode, as system states and actions scale
exponentially with the number of components, whereas environment dynamics are
difficult to be described in explicit forms for the entire system and may only
be accessible through numerical simulators. In this work, to address these
issues, an integrated Deep Reinforcement Learning (DRL) framework is
introduced. The Deep Centralized Multi-agent Actor Critic (DCMAC) is developed,
an off-policy actor-critic DRL approach, providing efficient life-cycle
policies for large multi-component systems operating in high-dimensional
spaces. Apart from deep function approximations that parametrize large state
spaces, DCMAC also adopts a factorized representation of the system actions,
being able to designate individualized component- and subsystem-level
decisions, while maintaining a centralized value function for the entire
system. DCMAC compares well against Deep Q-Network (DQN) solutions and exact
policies, where applicable, and outperforms optimized baselines that are based
on time-based, condition-based and periodic policies
Being Bayesian about Network Structure
In many domains, we are interested in analyzing the structure of the
underlying distribution, e.g., whether one variable is a direct parent of the
other. Bayesian model-selection attempts to find the MAP model and use its
structure to answer these questions. However, when the amount of available data
is modest, there might be many models that have non-negligible posterior. Thus,
we want compute the Bayesian posterior of a feature, i.e., the total posterior
probability of all models that contain it. In this paper, we propose a new
approach for this task. We first show how to efficiently compute a sum over the
exponential number of networks that are consistent with a fixed ordering over
network variables. This allows us to compute, for a given ordering, both the
marginal probability of the data and the posterior of a feature. We then use
this result as the basis for an algorithm that approximates the Bayesian
posterior of a feature. Our approach uses a Markov Chain Monte Carlo (MCMC)
method, but over orderings rather than over network structures. The space of
orderings is much smaller and more regular than the space of structures, and
has a smoother posterior `landscape'. We present empirical results on synthetic
and real-life datasets that compare our approach to full model averaging (when
possible), to MCMC over network structures, and to a non-Bayesian bootstrap
approach.Comment: Appears in Proceedings of the Sixteenth Conference on Uncertainty in
Artificial Intelligence (UAI2000
Reinforcement Learning
Reinforcement learning (RL) is a general framework for adaptive control,
which has proven to be efficient in many domains, e.g., board games, video
games or autonomous vehicles. In such problems, an agent faces a sequential
decision-making problem where, at every time step, it observes its state,
performs an action, receives a reward and moves to a new state. An RL agent
learns by trial and error a good policy (or controller) based on observations
and numeric reward feedback on the previously performed action. In this
chapter, we present the basic framework of RL and recall the two main families
of approaches that have been developed to learn a good policy. The first one,
which is value-based, consists in estimating the value of an optimal policy,
value from which a policy can be recovered, while the other, called policy
search, directly works in a policy space. Actor-critic methods can be seen as a
policy search technique where the policy value that is learned guides the
policy improvement. Besides, we give an overview of some extensions of the
standard RL framework, notably when risk-averse behavior needs to be taken into
account or when rewards are not available or not known.Comment: Chapter in "A Guided Tour of Artificial Intelligence Research",
Springe
Efficient exploration with Double Uncertain Value Networks
This paper studies directed exploration for reinforcement learning agents by
tracking uncertainty about the value of each available action. We identify two
sources of uncertainty that are relevant for exploration. The first originates
from limited data (parametric uncertainty), while the second originates from
the distribution of the returns (return uncertainty). We identify methods to
learn these distributions with deep neural networks, where we estimate
parametric uncertainty with Bayesian drop-out, while return uncertainty is
propagated through the Bellman equation as a Gaussian distribution. Then, we
identify that both can be jointly estimated in one network, which we call the
Double Uncertain Value Network. The policy is directly derived from the learned
distributions based on Thompson sampling. Experimental results show that both
types of uncertainty may vastly improve learning in domains with a strong
exploration challenge.Comment: Deep Reinforcement Learning Symposium @ Conference on Neural
Information Processing Systems (NIPS) 201
A Survey on Practical Applications of Multi-Armed and Contextual Bandits
In recent years, multi-armed bandit (MAB) framework has attracted a lot of
attention in various applications, from recommender systems and information
retrieval to healthcare and finance, due to its stellar performance combined
with certain attractive properties, such as learning from less feedback. The
multi-armed bandit field is currently flourishing, as novel problem settings
and algorithms motivated by various practical applications are being
introduced, building on top of the classical bandit problem. This article aims
to provide a comprehensive review of top recent developments in multiple
real-life applications of the multi-armed bandit. Specifically, we introduce a
taxonomy of common MAB-based applications and summarize state-of-art for each
of those domains. Furthermore, we identify important current trends and provide
new perspectives pertaining to the future of this exciting and fast-growing
field.Comment: under review by IJCAI 2019 Surve
A Survey and Critique of Multiagent Deep Reinforcement Learning
Deep reinforcement learning (RL) has achieved outstanding results in recent
years. This has led to a dramatic increase in the number of applications and
methods. Recent works have explored learning beyond single-agent scenarios and
have considered multiagent learning (MAL) scenarios. Initial results report
successes in complex multiagent domains, although there are several challenges
to be addressed. The primary goal of this article is to provide a clear
overview of current multiagent deep reinforcement learning (MDRL) literature.
Additionally, we complement the overview with a broader analysis: (i) we
revisit previous key components, originally presented in MAL and RL, and
highlight how they have been adapted to multiagent deep reinforcement learning
settings. (ii) We provide general guidelines to new practitioners in the area:
describing lessons learned from MDRL works, pointing to recent benchmarks, and
outlining open avenues of research. (iii) We take a more critical tone raising
practical challenges of MDRL (e.g., implementation and computational demands).
We expect this article will help unify and motivate future research to take
advantage of the abundant literature that exists (e.g., RL and MAL) in a joint
effort to promote fruitful research in the multiagent community.Comment: Under review since Oct 2018. Earlier versions of this work had the
title: "Is multiagent deep reinforcement learning the answer or the question?
A brief survey
Learning the Structure of Dynamic Probabilistic Networks
Dynamic probabilistic networks are a compact representation of complex
stochastic processes. In this paper we examine how to learn the structure of a
DPN from data. We extend structure scoring rules for standard probabilistic
networks to the dynamic case, and show how to search for structure when some of
the variables are hidden. Finally, we examine two applications where such a
technology might be useful: predicting and classifying dynamic behaviors, and
learning causal orderings in biological processes. We provide empirical results
that demonstrate the applicability of our methods in both domains.Comment: Appears in Proceedings of the Fourteenth Conference on Uncertainty in
Artificial Intelligence (UAI1998
On Single-Channel Speech Enhancement and On Non-Linear Modulation-Domain Kalman Filtering
This report focuses on algorithms that perform single-channel speech
enhancement. The author of this report uses modulation-domain Kalman filtering
algorithms for speech enhancement, i.e. noise suppression and dereverberation,
in [1], [2], [3], [4] and [5]. Modulation-domain Kalman filtering can be
applied for both noise and late reverberation suppression and in [2], [1], [3]
and [4], various model-based speech enhancement algorithms that perform
modulation-domain Kalman filtering are designed, implemented and tested. The
model-based enhancement algorithm in [2] estimates and tracks the speech phase.
The short-time-Fourier-transform-based enhancement algorithm in [5] uses the
active speech level estimator presented in [6]. This report describes how
different algorithms perform speech enhancement and the algorithms discussed in
this report are addressed to researchers interested in monaural speech
enhancement. The algorithms are composed of different processing blocks and
techniques [7]; understanding the implementation choices made during the system
design is important because this provides insights that can assist the
development of new algorithms. Index Terms - Speech enhancement,
dereverberation, denoising, Kalman filter, minimum mean squared error
estimation.Comment: 13 page
Synthesizing Safe Policies under Probabilistic Constraints with Reinforcement Learning and Bayesian Model Checking
We propose to leverage epistemic uncertainty about constraint satisfaction of
a reinforcement learner in safety critical domains. We introduce a framework
for specification of requirements for reinforcement learners in constrained
settings, including confidence about results. We show that an agent's
confidence in constraint satisfaction provides a useful signal for balancing
optimization and safety in the learning process
Distributional Policy Optimization: An Alternative Approach for Continuous Control
We identify a fundamental problem in policy gradient-based methods in
continuous control. As policy gradient methods require the agent's underlying
probability distribution, they limit policy representation to parametric
distribution classes. We show that optimizing over such sets results in local
movement in the action space and thus convergence to sub-optimal solutions. We
suggest a novel distributional framework, able to represent arbitrary
distribution functions over the continuous action space. Using this framework,
we construct a generative scheme, trained using an off-policy actor-critic
paradigm, which we call the Generative Actor Critic (GAC). Compared to policy
gradient methods, GAC does not require knowledge of the underlying probability
distribution, thereby overcoming these limitations. Empirical evaluation shows
that our approach is comparable and often surpasses current state-of-the-art
baselines in continuous domains.Comment: Accepted to NeurIPS 201
- …