516 research outputs found
Reinforcement Learning for the Unit Commitment Problem
In this work we solve the day-ahead unit commitment (UC) problem, by
formulating it as a Markov decision process (MDP) and finding a low-cost policy
for generation scheduling. We present two reinforcement learning algorithms,
and devise a third one. We compare our results to previous work that uses
simulated annealing (SA), and show a 27% improvement in operation costs, with
running time of 2.5 minutes (compared to 2.5 hours of existing
state-of-the-art).Comment: Accepted and presented in IEEE PES PowerTech, Eindhoven 2015, paper
ID 46273
Reinforcement Learning and Mixed-Integer Programming for Power Plant Scheduling in Low Carbon Systems: Comparison and Hybridisation
Decarbonisation is driving dramatic growth in renewable power generation.
This increases uncertainty in the load to be served by power plants and makes
their efficient scheduling, known as the unit commitment (UC) problem, more
difficult. UC is solved in practice by mixed-integer programming (MIP) methods;
however, there is growing interest in emerging data-driven methods including
reinforcement learning (RL). In this paper, we extensively test two MIP
(deterministic and stochastic) and two RL (model-free and with lookahead)
scheduling methods over a large set of test days and problem sizes, for the
first time comparing the state-of-the-art of these two approaches on a level
playing field. We find that deterministic and stochastic MIP consistently
produce lower-cost UC schedules than RL, exhibiting better reliability and
scalability with problem size. Average operating costs of RL are more than 2
times larger than stochastic MIP for a 50-generator test case, while the cost
is 13 times larger in the worst instance. However, the key strength of RL is
the ability to produce solutions practically instantly, irrespective of problem
size. We leverage this advantage to produce various initial solutions for warm
starting concurrent stochastic MIP solves. By producing several near-optimal
solutions simultaneously and then evaluating them using Monte Carlo methods,
the differences between the true cost function and the discrete approximation
required to formulate the MIP are exploited. The resulting hybrid technique
outperforms both the RL and MIP methods individually, reducing total operating
costs by 0.3% on average.Comment: Submitted to Applied Energy, Dec 202
DoMo-AC: Doubly Multi-step Off-policy Actor-Critic Algorithm
Multi-step learning applies lookahead over multiple time steps and has proved
valuable in policy evaluation settings. However, in the optimal control case,
the impact of multi-step learning has been relatively limited despite a number
of prior efforts. Fundamentally, this might be because multi-step policy
improvements require operations that cannot be approximated by stochastic
samples, hence hindering the widespread adoption of such methods in practice.
To address such limitations, we introduce doubly multi-step off-policy VI
(DoMo-VI), a novel oracle algorithm that combines multi-step policy
improvements and policy evaluations. DoMo-VI enjoys guaranteed convergence
speed-up to the optimal policy and is applicable in general off-policy learning
settings. We then propose doubly multi-step off-policy actor-critic (DoMo-AC),
a practical instantiation of the DoMo-VI algorithm. DoMo-AC introduces a
bias-variance trade-off that ensures improved policy gradient estimates. When
combined with the IMPALA architecture, DoMo-AC has showed improvements over the
baseline algorithm on Atari-57 game benchmarks
Temporal Difference Learning in Complex Domains
PhDThis thesis adapts and improves on the methods of TD(k) (Sutton 1988) that were
successfully used for backgammon (Tesauro 1994) and applies them to other complex
games that are less amenable to simple pattem-matching approaches. The games
investigated are chess and shogi, both of which (unlike backgammon) require
significant amounts of computational effort to be expended on search in order to
achieve expert play. The improved methods are also tested in a non-game domain.
In the chess domain, the adapted TD(k) method is shown to successfully learn the
relative values of the pieces, and matches using these learnt piece values indicate that
they perform at least as well as piece values widely quoted in elementary chess books.
The adapted TD(X) method is also shown to work well in shogi, considered by many
researchers to be the next challenge for computer game-playing, and for which there
is no standardised set of piece values.
An original method to automatically set and adjust the major control parameters used
by TD(k) is presented. The main performance advantage comes from the learning
rate adjustment, which is based on a new concept called temporal coherence.
Experiments in both chess and a random-walk domain show that the temporal
coherence algorithm produces both faster learning and more stable values than both
human-chosen parameters and an earlier method for learning rate adjustment.
The methods presented in this thesis allow programs to learn with as little input of
external knowledge as possible, exploring the domain on their own rather than by
being taught. Further experiments show that the method is capable of handling many
hundreds of weights, and that it is not necessary to perform deep searches during the
leaming phase in order to learn effective weight
Object-Oriented Dynamics Learning through Multi-Level Abstraction
Object-based approaches for learning action-conditioned dynamics has
demonstrated promise for generalization and interpretability. However, existing
approaches suffer from structural limitations and optimization difficulties for
common environments with multiple dynamic objects. In this paper, we present a
novel self-supervised learning framework, called Multi-level Abstraction
Object-oriented Predictor (MAOP), which employs a three-level learning
architecture that enables efficient object-based dynamics learning from raw
visual observations. We also design a spatial-temporal relational reasoning
mechanism for MAOP to support instance-level dynamics learning and handle
partial observability. Our results show that MAOP significantly outperforms
previous methods in terms of sample efficiency and generalization over novel
environments for learning environment models. We also demonstrate that learned
dynamics models enable efficient planning in unseen environments, comparable to
true environment models. In addition, MAOP learns semantically and visually
interpretable disentangled representations.Comment: Accepted to the Thirthy-Fourth AAAI Conference On Artificial
Intelligence (AAAI), 202
Temoral Difference Learning in Complex Domains
Submitted to the University of London for the Degree of Doctor of Philosophy in Computer Scienc
Learning Generalized Reactive Policies using Deep Neural Networks
We present a new approach to learning for planning, where knowledge acquired
while solving a given set of planning problems is used to plan faster in
related, but new problem instances. We show that a deep neural network can be
used to learn and represent a \emph{generalized reactive policy} (GRP) that
maps a problem instance and a state to an action, and that the learned GRPs
efficiently solve large classes of challenging problem instances. In contrast
to prior efforts in this direction, our approach significantly reduces the
dependence of learning on handcrafted domain knowledge or feature selection.
Instead, the GRP is trained from scratch using a set of successful execution
traces. We show that our approach can also be used to automatically learn a
heuristic function that can be used in directed search algorithms. We evaluate
our approach using an extensive suite of experiments on two challenging
planning problem domains and show that our approach facilitates learning
complex decision making policies and powerful heuristic functions with minimal
human input. Videos of our results are available at goo.gl/Hpy4e3
Secure and cost-effective operation of low carbon power systems under multiple uncertainties
Power system decarbonisation is driving the rapid deployment of renewable energy sources (RES) like wind and solar at the transmission and distribution level. Their differences from the synchronous thermal plants they are displacing make secure and efficient grid operation challenging. Frequency stability is of particular concern due to the current lack of provision of frequency ancillary services like inertia or response from RES generators. Furthermore, the weather dependency of RES generation coupled with the proliferation of distributed energy resources (DER) like small-scale solar or electric vehicles permeates future low-carbon systems with uncertainty under which legacy scheduling methods are inadequate. Overly cautious approaches to this uncertainty can lead to inefficient and expensive systems, whilst naive
methods jeopardise system security.
This thesis significantly advances the frequency-constrained scheduling literature by developing frameworks that explicitly account for multiple new uncertainties. This is in addition to RES forecast uncertainty which is the exclusive focus of most previous works. The frameworks take the form of convex constraints that are useful in many market and scheduling problems.
The constraints equip system operators with tools to explicitly guarantee their preferred level of system security whilst unlocking substantial value from emerging and abundant DERs. A major contribution is to address the exclusion of DERs from the provision of ancillary services due to their intrinsic uncertainty from aggregation. This is done by incorporating the uncertainty into the system frequency dynamics, from which deterministic convex constraints are derived. In addition to managing uncertainty to facilitate emerging DERs to provide legacy frequency services, a novel frequency containment service is designed. The framework allows a small amount of load shedding to assist with frequency containment during high RES low inertia periods. The expected cost of this service is probabilistic as it is proportional to the probability of a contingency occurring. The framework optimally balances the potentially higher expected costs of an outage against the operational cost benefits of lower ancillary service requirements day-to-day.
The developed frameworks are applied extensively to several case studies. These validate their security and demonstrate their significant economic and emission-saving benefits.Open Acces
Learning Visual Locomotion with Cross-Modal Supervision
In this work, we show how to learn a visual walking policy that only uses a
monocular RGB camera and proprioception. Since simulating RGB is hard, we
necessarily have to learn vision in the real world. We start with a blind
walking policy trained in simulation. This policy can traverse some terrains in
the real world but often struggles since it lacks knowledge of the upcoming
geometry. This can be resolved with the use of vision. We train a visual module
in the real world to predict the upcoming terrain with our proposed algorithm
Cross-Modal Supervision (CMS). CMS uses time-shifted proprioception to
supervise vision and allows the policy to continually improve with more
real-world experience. We evaluate our vision-based walking policy over a
diverse set of terrains including stairs (up to 19cm high), slippery slopes
(inclination of 35 degrees), curbs and tall steps (up to 20cm), and complex
discrete terrains. We achieve this performance with less than 30 minutes of
real-world data. Finally, we show that our policy can adapt to shifts in the
visual field with a limited amount of real-world experience. Video results and
code at https://antonilo.github.io/vision_locomotion/.Comment: Learning to walk from pixels in the real world by using
proprioception as supervision. Project page for videos and code:
https://antonilo.github.io/vision_locomotion
- …