26 research outputs found
Toward Data Efficient Online Sequential Learning
Can machines optimally take sequential decisions over time? Since decades, researchers have been seeking an answer to this question, with the ultimate goal of unlocking the potential of artificial general intelligence (AGI) for a better and sustainable society. Many are the sectors that would be boosted by machines being able to take efficient sequential decisions over time. Let think at real-world applications such as personalized systems in entertainment (content systems) but also in healthcare (personalized therapy), smart cities (traffic control, flooding prevention), robots (control and planning), etc.. However, letting machines taking proper decisions in real-life is a highly challenging task. This is caused by the uncertainty behind such decisions (uncertainty on the actual reward, on the context, on the environment, etc.). A viable solution is to learn by experience (i.e., by trial and error), letting the machines uncover the uncertainty while taking decisions, and refining its strategy accordingly. However, such refinement is usually highly data-hungry (data-inefficiency), requiring a large amount of application specified data, leading to very slow learning processes -- hence very slow convergence to optimal strategies (curse of dimensionality). Luckily, data is usually intrinsically structured. Identifying and exploiting such structure substantially improves the data-efficiency of sequential learning algorithms. This is the key hypothesis underpinning the research in this thesis, in which novel structural learning methodologies are proposed for decision-making strategies problems such as Recommendation System (RS), Multi-armed Bandit (MAB) and Reinforcement Learning (RL), with the ultimate goal of making the learning process more (data)-efficient. Specifically, we tackle such goal from the perspective of modelling the problem structure as graphs, embedding tools from graph signal processing into decision learning theory.
As the first step, we study the application of graph-clustering techniques for RS, in which the curse of dimensionality is addressed by grouping data into clusters via graph-clustering techniques. Next, we exploit spectral graph structure for MAB problems, representing online learning problems. A key challenge is to learn sequentially the unknown bandit vector. Exploiting the smoothness-prior (i.e., bandit vector smooth on a given underpinning graph), we study theoretically the Laplacian-regularized estimator and provide both empirical evidences and theoretical analysis on the benefits of exploiting the graph structure in MABs. Then, we focus on the theoretical understanding of the Laplacian-regularized estimator. To this end, we derive a theoretical error upper bound on the estimator, which illustrates the impact of the alignment between the data and the graph structure as well as the graph spectrum on the estimation accuracy.
We then move to RL problems, focusing on the specific problem of learning a proper representation of the state-action (representation learning problem). Motivated by the fact that a good representation should be informative of the value function, we seek a learning algorithm able to preserve continuity between the value function and the representation space. Showing that state values are intrinsically correlated to the state transition dynamic structure and the diffusion of the reward on the MDP graph, we build a new loss function based on the newly defined diffusion distance and we propose a novel method to learn state representation with such desirable property.
In summary, in this thesis we address both theoretically and empirically important online sequential learning problems leveraging on the intrinsic data structure, showing the gain of the proposed solutions toward more data-efficient sequential learning strategies
Exploration and exploitation in Bayes sequential decision problems
Bayes sequential decision problems are an extensive problem class with wide application. They involve taking actions in sequence in a system which has characteristics which are unknown or only partially known. These characteristics can be learnt over time as a result of our actions. Therefore we are faced with a trade-off between choosing actions that give desirable short term outcomes (exploitation) and actions that yield useful information about the system which can be used to improve longer term outcomes (exploration). Gittins indices provide an optimal method for a small but important subclass of these problems. Unfortunately the optimality of index methods does not hold generally and Gittins indices can be impractical to calculate for many problems. This has motivated the search for easy-to-calculate heuristics with general application. One such non-index method is the knowledge gradient heuristic. A thorough investigation of the method is made which identifies crucial weaknesses. Index and non-index variants are developed which avoid these weaknesses. The problem of choosing multiple website elements to present to user is an important problem relevant to many major web-based businesses. A Bayesian multi-armed bandit model is developed which captures the interactions between elements and the dual uncertainties of both user preferences and element quality. The problem has many challenging features but solution methods are proposed that are both easy to implement and which can be adapted to particular applications. Finally, easy-to-use software to calculate Gittins indices for Bernoulli and normal rewards has been developed as part of this thesis and has been made publicly available. The methodology used is presented together with a study of accuracy and speed
Decision making in an uncertain world
Campus Scene; Undatedhttps://egrove.olemiss.edu/phay_laf/1508/thumbnail.jp
Planning Algorithms for Multi-Robot Active Perception
A fundamental task of robotic systems is to use on-board sensors and perception algorithms to understand high-level semantic properties of an environment. These semantic properties may include a map of the environment, the presence of objects, or the parameters of a dynamic field. Observations are highly viewpoint dependent and, thus, the performance of perception algorithms can be improved by planning the motion of the robots to obtain high-value observations. This motivates the problem of active perception, where the goal is to plan the motion of robots to improve perception performance. This fundamental problem is central to many robotics applications, including environmental monitoring, planetary exploration, and precision agriculture. The core contribution of this thesis is a suite of planning algorithms for multi-robot active perception. These algorithms are designed to improve system-level performance on many fronts: online and anytime planning, addressing uncertainty, optimising over a long time horizon, decentralised coordination, robustness to unreliable communication, predicting plans of other agents, and exploiting characteristics of perception models. We first propose the decentralised Monte Carlo tree search algorithm as a generally-applicable, decentralised algorithm for multi-robot planning. We then present a self-organising map algorithm designed to find paths that maximally observe points of interest. Finally, we consider the problem of mission monitoring, where a team of robots monitor the progress of a robotic mission. A spatiotemporal optimal stopping algorithm is proposed and a generalisation for decentralised monitoring. Experimental results are presented for a range of scenarios, such as marine operations and object recognition. Our analytical and empirical results demonstrate theoretically-interesting and practically-relevant properties that support the use of the approaches in practice
Distributed Planning for Self-Organizing Production Systems
Für automatisierte Produktionsanlagen gibt es einen fundamentalen Tradeoff
zwischen Effizienz und Flexibilität. In den meisten Fällen sind die Abläufe
nicht nur durch den physischen Aufbau der Produktionsanlage, sondern auch durch
die spezielle zugeschnittene Programmierung der Anlagensteuerung fest
vorgegeben. Änderungen müssen aufwändig in einer Vielzahl von Systemen
nachgezogen werden. Das macht die Herstellung kleiner Stückzahlen unrentabel.
In dieser Dissertation wird ein Ansatz entwickelt, um eine automatische
Anpassung des Verhaltens von Produktionsanlagen an wechselnde Aufträge und
Rahmenbedingungen zu erreichen. Dabei kommt das Prinzip der Selbstorganisation
durch verteilte Planung zum Einsatz. Die aufeinander aufbauenden Ergebnisse der
Dissertation sind wie folgt:
1. Es wird ein Modell von Produktionsanlagen entwickelt, dass nahtlos von der
detaillierten Betrachtung physikalischer Produktionsprozesse bis hin zu
Lieferbeziehungen zwischen Unternehmen skaliert. Im Vergleich zu
existierenden Modellen von Produktionsanlagen werden weniger limitierende
Annahmen gestellt. In diesem Sinne ist der Modellierungsansatz ein Kandidat
für eine häufig geforderte "Theorie der Produktion".
2. Für die so modellierten Szenarien wird ein Algorithmus zur Optimierung der
nebenläufigen Abläufe entwickelt. Der Algorithmus verbindet Techniken für die
kombinatorische und die kontinuierliche Optimierung: Je nach Detailgrad und
Ausgestaltung des modellierten Szenarios kann der identische Algorithmus
kombinatorische Fertigungsfeinplanung (Scheduling) vornehmen, weltweite
Lieferbeziehungen unter Einbezug von Unsicherheiten und Risiko optimieren und
physikalische Prozesse prädiktiv regeln. Dafür werden Techniken der
Monte-Carlo Baumsuche (die auch bei Deepminds Alpha Go zum Einsatz kommen)
weiterentwickelt. Durch Ausnutzung zusätzlicher Struktur in den Modellen
skaliert der Ansatz auch auf große Szenarien.
3. Der Planungsalgorithmus wird auf die verteilte Optimierung durch unabhängige
Agenten übertragen. Dafür wird die sogenannte "Nutzen-Propagation" als
Koordinations-Mechanismus entwickelt. Diese ist von der Belief-Propagation
zur Inferenz in Probabilistischen Graphischen Modellen inspiriert. Jeder
teilnehmende Agent hat einen lokalen Handlungsraum, in dem er den
Systemzustand beobachten und handelnd eingreifen kann. Die Agenten sind an
der Maximierung der Gesamtwohlfahrt über alle Agenten hinweg interessiert.
Die dafür notwendige Kooperation entsteht über den Austausch von Nachrichten
zwischen benachbarten Agenten. Die Nachrichten beschreiben den erwarteten
Nutzen für ein angenommenes Verhalten im Handlungsraum beider Agenten.
4. Es wird eine Beschreibung der wiederverwendbaren Fähigkeiten von Maschinen
und Anlagen auf Basis formaler Beschreibungslogiken entwickelt. Ausgehend von
den beschriebenen Fähigkeiten, sowie der vorliegenden Aufträge mit ihren
notwendigen Produktionsschritten, werden ausführbare Aktionen abgeleitet. Die
ausführbaren Aktionen, mit wohldefinierten Vorbedingungen und Effekten,
kapseln benötigte Parametrierungen, programmierte Abläufe und die
Synchronisation von Maschinen zur Laufzeit.
Die Ergebnisse zusammenfassend werden Grundlagen für flexible automatisierte
Produktionssysteme geschaffen -- in einer Werkshalle, aber auch über Standorte
und Organisationen verteilt -- welche die ihnen innewohnenden Freiheitsgrade
durch Planung zur Laufzeit und agentenbasierte Koordination gezielt einsetzen
können. Der Bezug zur Praxis wird durch Anwendungsbeispiele hergestellt. Die
Machbarkeit des Ansatzes wurde mit realen Maschinen im Rahmen des EU-Projekts
SkillPro und in einer Simulationsumgebung mit weiteren Szenarien demonstriert
Delays in Reinforcement Learning
Delays are inherent to most dynamical systems. Besides shifting the process
in time, they can significantly affect their performance. For this reason, it
is usually valuable to study the delay and account for it. Because they are
dynamical systems, it is of no surprise that sequential decision-making
problems such as Markov decision processes (MDP) can also be affected by
delays. These processes are the foundational framework of reinforcement
learning (RL), a paradigm whose goal is to create artificial agents capable of
learning to maximise their utility by interacting with their environment.
RL has achieved strong, sometimes astonishing, empirical results, but delays
are seldom explicitly accounted for. The understanding of the impact of delay
on the MDP is limited. In this dissertation, we propose to study the delay in
the agent's observation of the state of the environment or in the execution of
the agent's actions. We will repeatedly change our point of view on the problem
to reveal some of its structure and peculiarities. A wide spectrum of delays
will be considered, and potential solutions will be presented. This
dissertation also aims to draw links between celebrated frameworks of the RL
literature and the one of delays