105 research outputs found
Routing in multi-class queueing networks
PhD ThesisWe consider the problem of routing (incorporating local scheduling) in a distributed
network. Dedicated jobs arrive directly at their specified station for processing. The
choice of station for generic jobs is open. Each job class has an associated holding cost
rate. We aim to develop routing policies to minimise the long-run average holding cost
rate.
We first consider the class of static policies. Dacre, Glazebrook and Nifio-Mora (1999)
developed an approach to the formulation of static routing policies, in which the work at
each station is scheduled optimally, using the achievable region approach. The achievable
region approach attempts to solve stochastic optimisation problems by characterising
the space of all possible performances and optimising the performance objective over
this space. Optimal local scheduling takes the form of a priority policy. Such static
routing policies distribute the generic traffic to the stations via a simple Bernoulli routing
mechanism. We provide an overview of the achievements made in following this approach
to static routing. In the course of this discussion we expand upon the study of Becker et al.
(2000) in which they considered routing to a collection of stations specialised in processing
certain job classes and we consider how the composition of the available stations affects
the system performance for this particular problem. We conclude our examination of
static routing policies with an investigation into a network design problem in which the
number of stations available for processing remains to be determined.
The second class of policies of interest is the class of dynamic policies. General DP
theory asserts the existence of a deterministic, stationary and Markov optimal dynamic
policy. However, a full DP solution may be unobtainable and theoretical difficulties posed
by simple routing problems suggest that a closed form optimal policy may not be available.
This motivates a requirement for good heuristic policies. We consider two approaches to
the development of dynamic routing heuristics. We develop an idea proposed, in the
context of simple single class systems, by Krishnan (1987) by applying a single policy
improvement step to some given static policy. The resulting dynamic policy is shown
to be of simple structure and easily computable. We include an investigation into the
comparative performance of the dynamic policy with a number of competitor policies and
of the performance of the heuristic as the number of stations in the network changes. In
our second approach the generic traffic may only access processing when the station has
been cleared of all (higher priority) jobs and can be considered as background work. We
deploy a prescription of Whittle (1988) developed for RBPs to develop a suitable approach
to station indexation. Taking an approximative approach to Whittle's proposal results
in a very simple form of index policy for routing the generic traffic. We investigate the
closeness to optimality of the index policy and compare the performance of both of the
dynamic routing policies developed here
A hybridisation technique for game playing using the upper confidence for trees algorithm with artificial neural networks
In the domain of strategic game playing, the use of statistical techniques such as the Upper Confidence for Trees (UCT) algorithm, has become the norm as they offer many benefits over classical algorithms. These benefits include requiring no game-specific strategic knowledge and time-scalable performance. UCT does not incorporate any strategic information specific to the game considered, but instead uses repeated sampling to effectively brute-force search through the game tree or search space. The lack of game-specific knowledge in UCT is thus both a benefit but also a strategic disadvantage. Pattern recognition techniques, specifically Neural Networks (NN), were identified as a means of addressing the lack of game-specific knowledge in UCT. Through a novel hybridisation technique which combines UCT and trained NNs for pruning, the UCTNN algorithm was derived. The NN component of UCT-NN was trained using a UCT self-play scheme to generate game-specific knowledge without the need to construct and manage game databases for training purposes. The UCT-NN algorithm is outlined for pruning in the game of Go-Moku as a candidate case-study for this research. The UCT-NN algorithm contained three major parameters which emerged from the UCT algorithm, the use of NNs and the pruning schemes considered. Suitable methods for finding candidate values for these three parameters were outlined and applied to the game of Go-Moku on a 5 by 5 board. An empirical investigation of the playing performance of UCT-NN was conducted in comparison to UCT through three benchmarks. The benchmarks comprise a common randomly moving opponent, a common UCTmax player which is given a large amount of playing time, and a pair-wise tournament between UCT-NN and UCT. The results of the performance evaluation for 5 by 5 Go-Moku were promising, which prompted an evaluation of a larger 9 by 9 Go-Moku board. The results of both evaluations indicate that the time allocated to the UCT-NN algorithm directly affects its performance when compared to UCT. The UCT-NN algorithm generally performs better than UCT in games with very limited time-constraints in all benchmarks considered except when playing against a randomly moving player in 9 by 9 Go-Moku. In real-time and near-real-time Go-Moku games, UCT-NN provides statistically significant improvements compared to UCT. The findings of this research contribute to the realisation of applying game-specific knowledge to the UCT algorithm
New Formulations and Solution Methods for the Dial-a-ride Problem
The classic Dial-A-Ride Problem (DARP) aims at designing the minimum-cost routing solution that accommodates a set of user requests under constraints at the operations planning level. It is a highly constrained combinatorial optimization problem initially designed for providing door-to-door transportation for people with limited mobility (e.g. the elderly or disabled). It consists of routing and scheduling a fleet of capacitated vehicles to service a set of requests with specified pickup and drop-off locations and time windows. With the details of requests obtained either beforehand (static DARP) or en-route (dynamic DARP), dial-a-ride operators strive to deliver efficient and yet high-quality transport services that satisfy each passenger's individual travel needs.
The goal of this thesis is threefold: (1) to propose rich DARP formulations where users' preferences are taken into account, in order to improve service quality of Demand-Responsive Transport (DRT) services and promote ridership strategically; (2) to develop novel and efficient solution methods where local search, column generation, metaheuristics and machine learning techniques are integrated to solve large-scale DARPs; and (3) to conduct real-life DARP case studies (using data extracted from NYC Yellow Taxi trip records) to test the practicality of proposed models and solution methods, as well as to emphasise the importance of connecting algorithms with real-world datasets. These aims are achieved and presented in the three core chapters of this thesis. In the first core chapter (Chapter 3), two Mixed Integer Programming (MIP) formulations (link-based and path-based) of DARP are presented, alongside with their objective functions and standard solution methods. This chapter builds the foundation of the thesis by elaborating the base models and algorithms that this thesis is based on, and by running benchmark experiments and reporting numerical results as the base line of the whole thesis. In the second core chapter (Chapter 4), two DARP models (one deterministic, one stochastic) integrated with users' preferences from dial-a-ride service operators' perspective are proposed, facilitating them to optimise their overall profit while maintaining service quality. In these models, users' utility users' preferences are considered within a dial-a-ride problem. A customized local search based heuristic and a matheuristic are developed to solve the proposed Chance-Constrained DARP (CC-DARP). Numerical results are reported for both DARP benchmark instances and a realistic case study based on New York City yellow taxi trip data. This chapter also explores the design of revenue/fleet management and pricing differentiation. The proposed chance-constrained DARP formulation provides a new decision-support tool to inform on revenue and fleet management, including fleet sizing, for DRT systems at a strategic planning level. In the last core chapter (Chapter 5), three hybrid metaheuristic algorithms integrated with Reinforcement Learning (RL) techniques are proposed and implemented, aiming to increase the scale-up capability of existing DARP solution methods. Machine learning techniques and/or a branching scheme are incorporated with various metaheuristic algorithms including VNS and LNS, providing innovative methodologies to solve large-instance DARPs in a more efficient manner. Thompson Sampling (TS) is applied to model dual values of requests under a column generation setting to negate the effect of dual oscillation (i.e. promote faster converging). The performance of proposed algorithms is tested benchmark datasets, and strengths and weaknesses across different algorithms are reported
Information-theoretic Reasoning in Distributed and Autonomous Systems
The increasing prevalence of distributed and autonomous systems is transforming decision making in industries as diverse as agriculture, environmental monitoring, and healthcare. Despite significant efforts, challenges remain in robustly planning under uncertainty. In this thesis, we present a number of information-theoretic decision rules for improving the analysis and control of complex adaptive systems. We begin with the problem of quantifying the data storage (memory) and transfer (communication) within information processing systems. We develop an information-theoretic framework to study nonlinear interactions within cooperative and adversarial scenarios, solely from observations of each agent's dynamics. This framework is applied to simulations of robotic soccer games, where the measures reveal insights into team performance, including correlations of the information dynamics to the scoreline. We then study the communication between processes with latent nonlinear dynamics that are observed only through a filter. By using methods from differential topology, we show that the information-theoretic measures commonly used to infer communication in observed systems can also be used in certain partially observed systems. For robotic environmental monitoring, the quality of data depends on the placement of sensors. These locations can be improved by either better estimating the quality of future viewpoints or by a team of robots operating concurrently. By robustly handling the uncertainty of sensor model measurements, we are able to present the first end-to-end robotic system for autonomously tracking small dynamic animals, with a performance comparable to human trackers. We then solve the issue of coordinating multi-robot systems through distributed optimisation techniques. These allow us to develop non-myopic robot trajectories for these tasks and, importantly, show that these algorithms provide guarantees for convergence rates to the optimal payoff sequence
Recommended from our members
Bayesian Learning for Data-Efficient Control
Applications to learn control of unfamiliar dynamical systems with increasing autonomy are ubiquitous. From robotics, to finance, to industrial processing, autonomous learning helps obviate a heavy reliance on experts for system identification and controller design. Often real world systems are nonlinear, stochastic, and expensive to operate (e.g. slow, energy intensive, prone to wear and tear). Ideally therefore, nonlinear systems can be identified with minimal system interaction. This thesis considers data efficient autonomous learning of control of nonlinear, stochastic systems. Data efficient learning critically requires probabilistic modelling of dynamics. Traditional control approaches use deterministic models, which easily overfit data, especially small datasets. We use probabilistic Bayesian modelling to learn systems from scratch, similar to the PILCO algorithm, which achieved unprecedented data efficiency in learning control of several benchmarks. We extend PILCO in three principle ways. First, we learn control under significant observation noise by simulating a filtered control process using a tractably analytic framework of Gaussian distributions. In addition, we develop the ‘latent variable belief Markov decision process’ when filters must predict under real-time constraints. Second, we improve PILCO’s data efficiency by directing exploration with predictive loss uncertainty and Bayesian optimisation, including a novel approximation to the Gittins index. Third, we take a step towards data efficient learning of high-dimensional control using Bayesian neural networks (BNN). Experimentally we show although filtering mitigates adverse effects of observation noise, much greater performance is achieved when optimising controllers with evaluations faithful to reality: by simulating closed-loop filtered control if executing closed-loop filtered control. Thus, controllers are optimised w.r.t. how they are used, outperforming filters applied to systems optimised by unfiltered simulations. We show directed exploration improves data efficiency. Lastly, we show BNN dynamics models are almost as data efficient as Gaussian process models. Results show data efficient learning of high-dimensional control is possible as BNNs scale to high-dimensional state inputs
Learning from interaction: models and applications
A large proportion of Machine Learning (ML) research focuses on designing algorithms that require minimal input from the human. However, ML algo- rithms are now widely used in various areas of engineering to design and build systems that interact with the human user and thus need to “learn” from this interaction. In this work, we concentrate on algorithms that learn from user interaction. A significant part of the dissertation is devoted to learning in the bandit setting. We propose a general framework for handling dependencies across arms, based on the new assumption that the mean-reward function is drawn from a Gaussian Process. Additionally, we propose an alternative method for arm selection using Thompson sampling and we apply the new algorithms to a grammar learning problem. In the remainder of the dissertation, we consider content-based image re- trieval in the case when the user is unable to specify the required content through tags or other image properties and so the system must extract infor- mation from the user through limited feedback. We present a novel Bayesian approach that uses latent random variables to model the systems imperfect knowledge about the users expected response to the images. An impor- tant aspect of the algorithm is the incorporation of an explicit exploration- exploitation strategy in the image sampling process. A second aspect of our algorithm is the way in which its knowledge of the target image is updated given user feedback. We considered a few algorithms to do so: variational Bayes, Gibbs sampling and a simple uniform update. We show in experi- ments that the simple uniform update performs best. The reason is because, unlike the uniform update, both variational Bayes and Gibbs sampling tend to focus on a small set of images aggressively
Monte Carlo Tree Search for games with Hidden Information and Uncertainty
Monte Carlo Tree Search (MCTS) is an AI technique
that has been successfully applied to many deterministic games
of perfect information, leading to large advances in a number of domains,
such as Go and General Game Playing.
Imperfect information games are less well studied in the field of AI
despite being popular and of significant commercial interest,
for example in the case of computer and mobile adaptations of turn based board and card games.
This is largely because hidden information and uncertainty
leads to a large increase in complexity compared to perfect information games.
In this thesis MCTS is extended to games with hidden information and uncertainty
through the introduction of the Information Set MCTS (ISMCTS) family of algorithms.
It is demonstrated that ISMCTS can handle hidden information and uncertainty
in a variety of complex board and card games.
This is achieved whilst preserving the general applicability of MCTS
and using computational budgets appropriate for use in a commercial game.
The ISMCTS algorithm is shown to outperform the existing approach of Perfect Information Monte Carlo (PIMC) search.
Additionally it is shown that ISMCTS can be used to solve two known issues with PIMC search,
namely strategy fusion and non-locality.
ISMCTS has been integrated into a commercial game, Spades by AI Factory,
with over 2.5 million downloads.
The Information Capture And ReUSe (ICARUS) framework is also introduced in this thesis.
The ICARUS framework generalises MCTS enhancements in terms of information capture (from MCTS simulations)
and reuse (to improve MCTS tree and simulation policies).
The ICARUS framework is used to express existing enhancements,
to provide a tool to design new ones,
and to rigorously define how MCTS enhancements can be combined.
The ICARUS framework is tested across a wide variety of games
Dynamic non-price strategy and competition: Models of R&D, advertising and location.
The dependence on past choices of present opportunities, costs, and benefits is pervasive in industrial markets. Each of the three chapters of this thesis considers a different example of such dependence affecting dynamic behaviour. In the first chapter a single firm's present choices depend on what it has learnt from past experience. The firm is searching for the best outcome of many multi-stage projects and learns as stages are completed. The branching structure of the search environment is such that the payoffs to various actions are correlated; nevertheless, it is shown that the optimal strategy is given by a simple reservation price rule. The chapter provides a simple model of R&D as an example. In the central model of the second chapter firms slowly build up stocks of goodwill through advertising. While many firms start to advertise in a new market, over time a successful set emerges and the others exit. The chapter explores the relative growth of firms and the determination of the number of successful ones. The chapter compares the results to those of a model in which a firm must complete all of a given number of R&D stages before being able to produce. The final chapter considers one of the effects of urban bus deregulation in the UK: bus arrival times are changed very frequently. It is assumed that passengers do not know the timetable and once at a stop board the first bus to arrive. There can be no equilibrium in which an operator's bus arrival times are never revised: otherwise those of a rival would arrive just before and take all the waiting passengers. The chapter considers the pattern of revisions when they are costly. The chapter also shows that fares can be higher with two competing operators than with a single monopolist
Learning domain abstractions for long lived robots
Recent trends in robotics have seen more general purpose robots being deployed in
unstructured environments for prolonged periods of time. Such robots are expected to
adapt to different environmental conditions, and ultimately take on a broader range of
responsibilities, the specifications of which may change online after the robot has been
deployed.
We propose that in order for a robot to be generally capable in an online sense
when it encounters a range of unknown tasks, it must have the ability to continually
learn from a lifetime of experience. Key to this is the ability to generalise from experiences
and form representations which facilitate faster learning of new tasks, as well as
the transfer of knowledge between different situations. However, experience cannot be
managed na¨ıvely: one does not want constantly expanding tables of data, but instead
continually refined abstractions of the data – much like humans seem to abstract and
organise knowledge. If this agent is active in the same, or similar, classes of environments
for a prolonged period of time, it is provided with the opportunity to build
abstract representations in order to simplify the learning of future tasks. The domain
is a common structure underlying large families of tasks, and exploiting this affords
the agent the potential to not only minimise relearning from scratch, but over time to
build better models of the environment. We propose to learn such regularities from the
environment, and extract the commonalities between tasks.
This thesis aims to address the major question: what are the domain invariances
which should be learnt by a long lived agent which encounters a range of different
tasks? This question can be decomposed into three dimensions for learning invariances,
based on perception, action and interaction. We present novel algorithms for
dealing with each of these three factors.
Firstly, how does the agent learn to represent the structure of the world? We focus
here on learning inter-object relationships from depth information as a concise
representation of the structure of the domain. To this end we introduce contact point
networks as a topological abstraction of a scene, and present an algorithm based on
support vector machine decision boundaries for extracting these from three dimensional
point clouds obtained from the agent’s experience of a domain. By reducing the
specific geometry of an environment into general skeletons based on contact between
different objects, we can autonomously learn predicates describing spatial relationships.
Secondly, how does the agent learn to acquire general domain knowledge? While
the agent attempts new tasks, it requires a mechanism to control exploration, particularly
when it has many courses of action available to it. To this end we draw on the fact
that many local behaviours are common to different tasks. Identifying these amounts
to learning “common sense” behavioural invariances across multiple tasks. This principle
leads to our concept of action priors, which are defined as Dirichlet distributions
over the action set of the agent. These are learnt from previous behaviours, and expressed
as the prior probability of selecting each action in a state, and are used to guide
the learning of novel tasks as an exploration policy within a reinforcement learning
framework.
Finally, how can the agent react online with sparse information? There are times
when an agent is required to respond fast to some interactive setting, when it may have
encountered similar tasks previously. To address this problem, we introduce the notion
of types, being a latent class variable describing related problem instances. The agent
is required to learn, identify and respond to these different types in online interactive
scenarios. We then introduce Bayesian policy reuse as an algorithm that involves maintaining
beliefs over the current task instance, updating these from sparse signals, and
selecting and instantiating an optimal response from a behaviour library.
This thesis therefore makes the following contributions. We provide the first algorithm
for autonomously learning spatial relationships between objects from point
cloud data. We then provide an algorithm for extracting action priors from a set of
policies, and show that considerable gains in speed can be achieved in learning subsequent
tasks over learning from scratch, particularly in reducing the initial losses associated
with unguided exploration. Additionally, we demonstrate how these action priors
allow for safe exploration, feature selection, and a method for analysing and advising
other agents’ movement through a domain. Finally, we introduce Bayesian policy
reuse which allows an agent to quickly draw on a library of policies and instantiate the
correct one, enabling rapid online responses to adversarial conditions
- …