144 research outputs found
The design of absorbing Bayesian pursuit algorithms and the formal analyses of their Īµ-optimality
The fundamental phenomenon that has been used to enhance the convergence speed of learning automata (LA) is that of incorporating the running maximum likelihood (ML) estimates of the action reward probabilities into the probability updating rules for selecting the actions. The frontiers of this field have been recently expanded by replacing the ML estimates with their corresponding Bayesian counterparts that incorporate the properties of the conjugate priors. These constitute the Bayesian pursuit algorithm (BPA), and the discretized Bayesian pursuit algorithm. Although these algorithms have been designed and efficiently implemented, and are, arguably, the fastest and most accurate LA reported in the literature, the proofs of their ĻµĻµ-optimal convergence has been unsolved. This is precisely the intent of this paper. In this paper, we present a single unifying analysis by which the proofs of both the continuous and discretized schemes are proven. We emphasize that unlike the ML-based pursuit schemes, the Bayesian schemes have to not only consider the estimates themselves but also the distributional forms of their conjugate posteriors and their higher order momentsāall of which render the proofs to be particularly challenging. As far as we know, apart from the results themselves, the methodologies of this proof have been unreported in the literatureāthey are both pioneering and novel
Solving Two-Person Zero-Sum Stochastic Games With Incomplete Information Using Learning Automata With Artificial Barriers
Learning automata (LA) with artificially absorbing barriers was a completely new horizon of research in the 1980s (Oommen, 1986). These new machines yielded properties that were previously unknown. More recently, absorbing barriers have been introduced in continuous estimator algorithms so that the proofs could follow a martingale property, as opposed to monotonicity (Zhang et al., 2014), (Zhang et al., 2015). However, the applications of LA with artificial barriers are almost nonexistent. In that regard, this article is pioneering in that it provides effective and accurate solutions to an extremely complex application domain, namely that of solving two-person zero-sum stochastic games that are provided with incomplete information. LA have been previously used (Sastry et al., 1994) to design algorithms capable of converging to the game's Nash equilibrium under limited information. Those algorithms have focused on the case where the saddle point of the game exists in a pure strategy. However, the majority of the LA algorithms used for games are absorbing in the probability simplex space, and thus, they converge to an exclusive choice of a single action. These LA are thus unable to converge to other mixed Nash equilibria when the game possesses no saddle point for a pure strategy. The pioneering contribution of this article is that we propose an LA solution that is able to converge to an optimal mixed Nash equilibrium even though there may be no saddle point when a pure strategy is invoked. The scheme, being of the linear reward-inaction ( ) paradigm, is in and of itself, absorbing. However, by incorporating artificial barriers, we prevent it from being ``stuck'' or getting absorbed in pure strategies. Unlike the linear reward-Īµpenalty ( ) scheme proposed by Lakshmivarahan and Narendra almost four decades ago, our new scheme achieves the same goal with much less parameter tuning and in a more elegant manner. This article includes the nontrial proofs of the theoretical results characterizing our scheme and also contains experimental verification that confirms our theoretical findings.acceptedVersio
The Hierarchical Discrete Learning Automaton Suitable for Environments with Many Actions and High Accuracy Requirements
Author's accepted manuscriptSince its early beginning, the paradigm of Learning Automata (LA), has attracted much interest. Over the last decades, new concepts and various improvements have been introduced to increase the LAās speed and accuracy, including employing probability updating functions, discretizing the probability space, and implementing the āPursuitā concept. The concept of incorporating āstructureā into the ordering of the LAās actions is one of the latest advancements to the field, leading to the Ļµ-optimal Hierarchical Continuous Pursuit LA (HCPA) that has superior performance to other LA variants when the number of actions is large. Although the previously proposed HCPA is powerful, its speed has a handicap when the required action probability of an action is approaching unity. The reason for this slow convergence is that the learning parameter operates in a multiplicative manner within the probability space, making the increment of the action probability smaller as its probability becomes close to unity. Therefore, we propose the novel Hierarchical Discrete Learning Automata (HDPA) in this paper, which does not possess the same impediment as the HCPA. The proposed machine infuse the principle of discretization into the action probability vectorās updating functionality, where this type of updating is invoked recursively at every depth within a hierarchical tree structure and we pursue the best estimated action in all iterations through utilization of the Estimator phenomenon. The proposed machine is Ļµ-optimal, and our experimental results demonstrate that the number of iterations required before convergence is significantly reduced for the HDPA, when compared with the HCPA.acceptedVersio
The Hierarchical Discrete Pursuit Learning Automaton: A Novel Scheme With Fast Convergence and Epsilon-Optimality
Author's accepted manuscriptĀ© 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting /republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Since the early 1960s, the paradigm of learning automata (LA) has experienced abundant interest. Arguably, it has also served as the foundation for the phenomenon and field of reinforcement learning (RL). Over the decades, new concepts and fundamental principles have been introduced to increase the LAās speed and accuracy. These include using probability updating functions, discretizing the probability space, and using the āPursuitā concept. Very recently, the concept of incorporating āstructureā into the ordering of the LAās actions has improved both the speed and accuracy of the corresponding hierarchical machines, when the number of actions is large. This has led to the Ļµ -optimal hierarchical continuous pursuit LA (HCPA). This article pioneers the inclusion of all the above-mentioned phenomena into a new single LA, leading to the novel hierarchical discretized pursuit LA (HDPA). Indeed, although the previously proposed HCPA is powerful, its speed has an impediment when any action probability is close to unity, because the updates of the components of the probability vector are correspondingly smaller when any action probability becomes closer to unity. We propose here, the novel HDPA, where we infuse the phenomenon of discretization into the action probability vectorās updating functionality, and which is invoked recursively at every stage of the machineās hierarchical structure. This discretized functionality does not possess the same impediment, because discretization prohibits it. We demonstrate the HDPAās robustness and validity by formally proving the Ļµ -optimality by utilizing the moderation property. We also invoke the submartingale characteristic at every level, to prove that the action probability of the optimal action converges to unity as time goes to infinity. Apart from the new machine being Ļµ -optimal, the numerical results demonstrate that the number of iterations required for convergence is significantly reduce...acceptedVersio
Utilising policy types for effective ad hoc coordination in multiagent systems
This thesis is concerned with the ad hoc coordination problem. Therein, the goal is
to design an autonomous agent which can achieve high flexibility and efficiency in a
multiagent system that admits no prior coordination between the designed agent and
the other agents. Flexibility describes the agentās ability to solve its task with a variety
of other agents in the system; efficiency is the relation between the agentās payoffs and
time needed to solve the task; and no prior coordination means that the agent does not
a priori know how the other agents behave. This problem is relevant for a number of
practical applications, including human-machine interaction tasks, such as adaptive user
interfaces, robotic elderly care, and automated trading agents.
Motivated by this problem, the central idea studied in this thesis is to utilise a set of
policies, or types, to characterise the behaviour of other agents. Specifically, the idea is
to reduce the complexity of the interaction problem by assuming that the other agents
draw their latent type from some known or hypothesised space of types, and that the
assignment of types is governed by an unknown distribution. Based on the current
interaction history, we can form posterior beliefs about the relative likelihood of types.
These beliefs, combined with the future predictions of the types, can then be used in a
planning procedure to compute optimal responses. The aim of this thesis is to study the
potential and limitations of this idea in the context of ad hoc coordination.
We formulate the ad hoc coordination problem using a game-theoretic model called
the stochastic Bayesian game. Based on this model, we derive a canonical algorithmic
description of the idea outlined above, called Harsanyi-Bellman Ad Hoc Coordination
(HBA). The practical potential of HBA is demonstrated in two case studies, including a
human-machine experiment and a simulated logistics domain. We formulate basic ways
to incorporate evidence (i.e. observed actions) into posterior beliefs and analyse the
conditions under which the posterior beliefs converge to the true distribution of types.
Furthermore, we study the impact of prior beliefs over types (that is, before any actions
are observed) on the long-term performance of HBA, and show empirically that automatic
methods can compute prior beliefs with consistent performance effects. For
hypothesised (i.e. āguessedā) type spaces, we analyse the relations between hypothesised
and true type spaces under which HBA is still guaranteed to solve its task, despite
inaccuracies in hypothesised types. Finally, we show how HBA can perform an automatic
statistical analysis to decide whether to reject its behavioural hypothesis, i.e. the
combination of posterior beliefs and types
Reinforcement Learning
Brains rule the world, and brain-like computation is increasingly used in computers and electronic devices. Brain-like computation is about processing and interpreting data or directly putting forward and performing actions. Learning is a very important aspect. This book is on reinforcement learning which involves performing actions to achieve a goal. The first 11 chapters of this book describe and extend the scope of reinforcement learning. The remaining 11 chapters show that there is already wide usage in numerous fields. Reinforcement learning can tackle control tasks that are too complex for traditional, hand-designed, non-learning controllers. As learning computers can deal with technical complexities, the tasks of human operators remain to specify goals on increasingly higher levels. This book shows that reinforcement learning is a very dynamic area in terms of theory and applications and it shall stimulate and encourage new research in this field
Multiobjective in-core fuel management optimisation for nuclear research reactors
Thesis (PhD)--Stellenbosch University, 2016.ENGLISH SUMMARY : The efficiency and effectiveness of fuel usage in a typical nuclear reactor is influenced by the
specific arrangement of available fuel assemblies in the reactor core positions. This arrangement of assemblies is referred to as a fuel reload configuration and usually has to be determined anew for each operational cycle of a reactor. Very often, multiple objectives are pursued simultaneously
when designing a reload configuration, especially in the context of nuclear research reactors. In the multiobjective in-core fuel management optimization (MICFMO) problem, the aim is to identify a Pareto optimal set of compromise or trade-off reload configurations. Such a set may then be
presented to a decision maker (i.e. a nuclear reactor operator) for consideration so as to select a preferred configuration.
In the first part of this dissertation, a secularization-based methodology for MICFMO is pro- posed in order to address several shortcomings associated with the popular weighting method often employed in the literature for solving the MICFMO problem. The proposed methodology has been
implemented in a reactor simulation code, called the OSCAR-4 system. In order to demonstrate its practical applicability, the methodology is applied to solve several MICFMO problem instances in the context of two research reactors.
In the second part of the dissertation, an extensive investigation is conducted into the suitability of several multiobjective optimization algorithms for solving the constrained MICFMO problem. The computation time required to perform the investigation is reduced through the usage of
several artificial neural networks constructed in the dissertation for objective and constraint function evaluations. Eight multiobjective metaheuristics are compared in the context of a test suite of several MICFMO problem instances, based on the SAFARI-1 research reactor in South Africa.
The investigation reveals that the NSGA-II, the P-ACO algorithm and the MOOCEM are generally the
best-performing metaheuristics across the problem instances in the test suite, while the MOVNS algorithm also performs well in the context of bi-objective problem instances. As part of this investigation, a multiplicative penalty function (MPF) constraint handling technique is also proposed and compared to an existing constraint handling technique, called constrained-domination.
The comparison reveals that the MPF technique is a competitive alternative to constrained-domination.
In an attempt to raise the level of generality at which MICFMO may be performed and potentially improve the quality of optimization results, a multiobjective hyperheuristic, called the AMALGAM
method, is also considered in this dissertation. This hyperheuristic incorporates multiple metaheuristic sub-algorithms simultaneously for optimization. Testing reveals that the AMALGAM method yields superior results in the majority of problem instances in the test suite, thus
achieving the dual goal of raising the level of generality and of yielding improved optimization results. The method has also been implemented in the OSCAR-4 system and is applied to solve several MICFMO case study problem instances, based on two research reactors, in order to demonstrate its
practical applicability.
Finally, in the third part of this dissertation, a conceptual framework is proposed for an optimization-based personal decision support system, dedicated to MICFM. This framework may serve as the basis for developing a computerized tool to aid nuclear reactor operators in designing suitable reload configurations.AFRIKAANSE OPSOMMING : Die doeltreffendheid en doelmatigheid van brandstofverbruik in 'n tipiese kernreaktor word deur die spesieke rangskikking van beskikbare brandstofelemente in die laaiposisies van die reaktor
beinvloed. Hierdie rangskikking staan bekend as 'n brandstof herlaaikongurasie en word gewoonlik opnuut bepaal vir elke operasionele siklus van 'n reaktor. Die gelyktydige optimering
van veelvuldige doele word dikwels tydens die ontwerp van 'n herlaaikongurasie nagestreef, veral binne die konteks van navorsingsreaktore. Die doelwit van meerdoelige binne-kern brandstofbeheeroptimering (MBKBBO) is om 'n Pareto optimale versameling van herlaaikongurasieafruilings
te identiseer. So 'n versameling mag dan vir oorweging (deur byvoorbeeld 'n kernreaktoroperateur) voorgele word sodat 'n voorkeurkongurasie gekies kan word.
In die eerste gedeelte van hierdie proefskrif word 'n skalariseringsgebaseerde metodologie vir MBKBBO voorgestel om verskeie tekortkominge in die gewilde gewigverswaringsmetode aan te spreek. Laasgenoemde metode word gereeld in die literatuur gebruik om die MBKBBO
probleem op te los. Die voorgestelde metodologie is in 'n reaktorsimulasiestelsel, bekend as die OSCAR-4 stelsel, geimplementeer. Om die praktiese toepasbaarheid daarvan te demonstreer, word die metodologie gebruik om 'n aantal MBKBBO probleemgevalle binne die konteks van twee navorsingsreaktore op te los.
In die tweede gedeelte van die proefskrif word 'n uitgebreide ondersoek ingestel om die geskiktheid van verskeie meerdoelige optimeringsalgoritmes vir die oplos van die beperkte MBKBBO probleem te bepaal. Die berekeningstyd wat vir die ondersoek benodig word, word verminder
deur die gebruik van kunsmatige neurale netwerke, wat in die proefskrif gekonstrueer word, om doelfunksies en beperkings te evalueer. Agt meerdoelige metaheuristieke word binne die
konteks van verskeie MBKBBO toetsprobleemgevalle vergelyk wat op die SAFARI-1 navorsingsreaktor in Suid-Afrika gebaseer is. Toetse dui daarop dat die NSGA-II, die P-ACO algoritme en die MOOCEM oor die algemeen die beste oor al die toetsprobleemgevalle presteer. Die MOVNS algoritme presteer ook goed in die konteks van tweedoelige probleemgevalle. 'n Vermenigvuldigende boetefunksie (VBF) beperkinghanteringstegniek word ook voorgestel en vergelyk
met 'n bestaande tegniek bekend as beperkte dominasie. Daar word bevind dat the VBF tegniek 'n mededingende alternatief tot beperkte dominasie is.
'n Poging word aangewend om die vlak van algemeenheid waarmee MBKBBO uitgevoer word, te verhoog, asook om potensieel die kwaliteit van die optimeringsresultate te verbeter. 'n Meerdoelige hiperheuristiek, bekend as die AMALGAM metode, word in die nastreef van hierdie twee
doelwitte oorweeg. Die metode funksioneer deur middel van die gelyktydige insluiting van 'n aantal metaheuristieke deel-algoritmes. Toetse dui daarop dat the AMALGAM metode beter
resultate vir die meerderheid van toetsprobleme lewer, en dus word die bogenoemde twee doelwitte bereik. Die metode is ook in the OSCAR-4 stelsel ge mplementeer en word gebruik om 'n aantal MBKBBO gevallestudie probleemgevalle (binne die konteks van twee navorsingsreaktore) op te los. Sodoende word die praktiese toepasbaarheid van die metode gedemonstreer.
In die derde deel van die proefskrif word 'n konseptuele raamwerk laastens vir 'n optimeringsgebaseerde
persoonlike besluitsteunstelsel gemik op MBKBB, voorgestel. Hierdie raamwerk mag as grondslag dien vir die ontwikkeling van 'n gerekenariseerde hulpmiddel vir kernreaktoroperateurs
om aanvaarbare herlaaikongurasies te ontwerp.Doctora
- ā¦