Search CORE

48 research outputs found

Utility-based Reinforcement Learning for Reactive Grids

Author: Germain-Renaud Cécile
Kégl Balázs
Loomis C.
Perez Julien
Publication venue: HAL CCSD
Publication date: 01/05/2008
Field of study

International audienceLarge scale production grids are an important case for autonomic computing. They follow a mutualization paradigm: decision-making (human or automatic) is distributed and largely independent, and, at the same time, it must implement the highlevel goals of the grid management. This paper deals with the scheduling problem with two partially conflicting goals: fairshare and Quality of Service (QoS). Fair sharing is a wellknown issue motivated by return on investment for participating institutions. Differentiated QoS has emerged as an important and unexpected requirement in the current usage of production grids. In the framework of the EGEE grid (one of the largest existing grids), applications from diverse scientific communities require a pseudo-interactive response time. More generally, seamless integration of the grid power into everyday use calls for unplanned and interactive access to grid resources, which defines reactive grids. The major result of this paper is that the combination of utility functions and reinforcement learning (RL) provides a general and efficient method for dynamically allocating grid resources in order to satisfy both end users with differentiated requirements and participating institutions. Combining RL methods and utility functions for resource allocation was pioneered by Tesauro and Vengerov. While the application contexts are different, the resource allocation issues are very similar. The main difference in our work is that we consider a multi-criteria optimization problem that includes a fair-share objective. A first contribution of our work is the definition of a set of variables describing states and actions that allows us to formulate the grid scheduling problem as a continuous action-state space reinforcement learning problem. To capture the immediate goals of end users and the long-term objectives of administrators, we propose automatically derived utility functions. Finally, our experimental results on a synthetic workload and a real EGEE trace show that RL clearly outperforms the classical schedulers, so it is a realistic alternative to empirical scheduler design

HAL-CentraleSupelec

HAL-IN2P3

INRIA a CCSD electronic archive server

HAL-Rennes 1

La surveillance efficace de bout-à-bout pour la gestion des pannes dans les systèmes distribués

Author: FENG Dawei
GERMAIN-RENAUD Cécile
Publication venue
Publication date: 01/01/2014
Field of study

Dans cette thèse, nous présentons notre travail sur la gestion des pannes dans les systèmes distribués, avec comme motivation principale le suivi de fautes et de changements brusques dans de grands systèmes informatiques comme la grille et le cloud.Au lieu de construire une connaissance complète a priori du logiciel et des infrastructures matérielles comme dans les méthodes traditionnelles de détection ou de diagnostic, nous proposons d'utiliser des techniques spécifiques pour effectuer une surveillance de bout en bout dans des systèmes de grande envergure, en laissant les détails inaccessibles des composants impliqués dans une boîte noire.Pour la surveillance de pannes d'un système distribué, nous modélisons tout d'abord cette application basée sur des sondes comme une tâche de prédiction statique de collaboration (CP), et démontrons expérimentalement l'efficacité des méthodes de CP en utilisant une méthode de la max margin matrice factorisation. Nous introduisons en outre l apprentissage actif dans le cadre de CP et exposons son avantage essentiel dans le traitement de données très déséquilibrées, ce qui est particulièrement utile pour identifier la class de classe de défaut de la minorité.Nous étendons ensuite la surveillance statique de défection au cas séquentiel en proposant la méthode de factorisation séquentielle de matrice (SMF). La SMF prend une séquence de matrices partiellement observées en entrée, et produit des prédictions comportant des informations à la fois sur les fenêtres temporelles actuelle et passé. L apprentissage actif est également utilisé pour la SMF, de sorte que les données très déséquilibrées peuvent être traitées correctement. En plus des méthodes séquentielles, une action de lissage pris sur la séquence d'estimation s'est avérée être une astuce pratique utile pour améliorer la performance de la prédiction séquentielle.Du fait que l'hypothèse de stationnarité utilisée dans le surveillance statique et séquentielle devient irréaliste en présence de changements brusques, nous proposons un framework en ligne semi-supervisé de détection de changement (SSOCD) qui permette de détecter des changements intentionnels dans les données de séries temporelles. De cette manière, le modèle statique du système peut être recalculé une fois un changement brusque est détecté. Dans SSOCD, un procédé hors ligne non supervisé est proposé pour analyser un échantillon des séries de données. Les points de changement ainsi détectés sont utilisés pour entraîner un modèle en ligne supervisé, qui fournit une décision en ligne concernant la détection de changement à parti de la séquence de données en entrée. Les méthodes de détection de changements de l état de l art sont utilisées pour démontrer l'utilité de ce framework.Tous les travaux présentés sont vérifiés sur des ensembles de données du monde réel. Plus précisément, les expériences de surveillance de panne sont effectuées sur un ensemble de données recueillies auprès de l infrastructure de grille Biomed faisant partie de l European Grid Initiative et le framework de détection de changement brusque est vérifié sur un ensemble de données concernant le changement de performance d'un site en ligne ayant un fort trafic.In this dissertation, we present our work on fault management in distributed systems, with motivating application roots in monitoring fault and abrupt change of large computing systems like the grid and the cloud. Instead of building a complete a priori knowledge of the software and hardware infrastructures as in conventional detection or diagnosis methods, we propose to use appropriate techniques to perform end-to-end monitoring for such large scale systems, leaving the inaccessible details of involved components in a black box.For the fault monitoring of a distributed system, we first model this probe-based application as a static collaborative prediction (CP) task, and experimentally demonstrate the effectiveness of CP methods by using the max margin matrix factorization method. We further introduce active learning to the CP framework and exhibit its critical advantage in dealing with highly imbalanced data, which is specially useful for identifying the minority fault class.Further we extend the static fault monitoring to the sequential case by proposing the sequential matrix factorization (SMF) method. SMF takes a sequence of partially observed matrices as input, and produces predictions with information both from the current and history time windows. Active learning is also employed to SMF, such that the highly imbalanced data can be coped with properly. In addition to the sequential methods, a smoothing action taken on the estimation sequence has shown to be a practically useful trick for enhancing sequential prediction performance.Since the stationary assumption employed in the static and sequential fault monitoring becomes unrealistic in the presence of abrupt changes, we propose a semi-supervised online change detection (SSOCD) framework to detect intended changes in time series data. In this way, the static model of the system can be recomputed once an abrupt change is detected. In SSOCD, an unsupervised offline method is proposed to analyze a sample data series. The change points thus detected are used to train a supervised online model, which gives online decision about whether there is a change presented in the arriving data sequence. State-of-the-art change detection methods are employed to demonstrate the usefulness of the framework.All presented work is verified on real-world datasets. Specifically, the fault monitoring experiments are conducted on a dataset collected from the Biomed grid infrastructure within the European Grid Initiative, and the abrupt change detection framework is verified on a dataset concerning the performance change of an online site with large amount of traffic.PARIS11-SCD-Bib. électronique (914719901) / SudocSudocFranceF

OpenGrey Repository

The Green Computing Observatory: a data curation approach for green IT

Author: Furst Frederic
Germain-Renaud Cécile
Jouvin Michel
Kassel Gilles
Nauroy Julien
Philippon Guillaume
Publication venue: HAL CCSD
Publication date: 01/12/2011
Field of study

International audienceThe first barrier to improved energy efficiency is the difficulty of collecting data on the energy consumption of individual components of data centers, and the lack of overall data collection. GCO collects monitoring data on energy consumption of a large computing center, and publish them through the Grid Observatory. These data include the detailed monitoring of the processors and motherboards, as well as the global site information, such as overall consumption and overall cooling. A second barrier is making the collected data usable. The difficulty is to make the data readily consistent and complete, as well as understandable for further exploitation. For this purpose, GCO opts for an ontological approach in order to rigorously define the semantics of the data (what is measured) and the context of their production (how are they acquired and/or calculated). The Green Computing Observatory (GCO) addresses the previous issues within the framework of a production infrastructure dedicated to e-science, providing a unique facility for the Computer Science and Engineering community. The overall goal is to create a full-fledged data curation process. This paper reports on the first achievements, specifically acquisition and ontology

HAL-CentraleSupelec

HAL-IN2P3

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

HAL-Rennes 1

Grid Differentiated Services: a Reinforcement Learning Approach

Author: Germain Renaud Cécile
Kégl Balázs
Loomis C.
Perez Julien
Publication venue: HAL CCSD
Publication date: 01/05/2008
Field of study

International audienceLarge scale production grids are a major case for autonomic computing. Following the classical definition of Kephart, an autonomic computing system should optimize its own behavior in accordance with high level guidance from humans. This central tenet of this paper is that the combination of utility functions and reinforcement learning (RL) can provide a general and efficient method for dynamically allocating grid resources in order to optimize the satisfaction of both endusers and participating institutions. The flexibility of an RLbased system allows to model the state of the grid, the jobs to be scheduled, and the high-level objectives of the various actors on the grid. RL-based scheduling can seamlessly adapt its decisions to changes in the distributions of inter-arrival time, QoS requirements, and resource availability. Moreover, it requires minimal prior knowledge about the target environment, including user requests and infrastructure. Our experimental results, both on a synthetic workload and a real trace, show that RL is not only a realistic alternative to empirical scheduler design, but is able to outperform them

HAL-CentraleSupelec

HAL-IN2P3

INRIA a CCSD electronic archive server

HAL-Rennes 1

ABSTRACT

Author: Cécile Germain-renaud
Publication venue
Publication date
Field of study

Result checking is the theory and practice of proving that the result of an execution of a program on an input is correct. Result checking has most often been envisioned in the framework of program testing or property testing, where the issue is the conformity of the program to some a-priori specification. Very large scale distributed computing systems demand to tackle the issue of computation correctness, albeit from hypothesis very different from the program testing ones. The general issues examined in this paper are the following. First, the definition of checking methods adapted to large-scale Monte-Carlo simulations; for these applications, no external criterion can be used to assess the quality of the result. Second, two result checking algorithms which minimize the overall overhead through an adaptive strategy. Finally, a specialization of this framework to a case study, the Auger astrophysics experiment. Our main contributions are: first to focus on checking Monte-Carlo simulations, which have rarely been considered previously; second to define a probabilistic checking strategy including the risk of first kind (false positive) as well as the risk of second kind (false negative) which is usually the only one considered, and which is compatible with Byzantine saboteurs; third, to exploit the probable characteristics of the behaviour of the saboteurs to optimise for the most frequent case. Finally, we show on a case study that the implementation details can be carried out

CiteSeerX

Java-Based Coupling for Parallel Predictive-Adaptive Domain Decomposition

Author: Cécile Germain‐Renaud
Vincent Néri
Publication venue: 'Hindawi Limited'
Publication date: 01/01/1999
Field of study

Adaptive domain decomposition exemplifies the problem of integrating heterogeneous software components with intermediate coupling granularity. This paper describes an experiment where a data‐parallel (HPF) client interfaces with a sequential computation server through Java. We show that seamless integration of data‐parallelism is possible, but requires most of the tools from the Java palette: Java Native Interface (JNI), Remote Method Invocation (RMI), callbacks and threads

Directory of Open Access Journals

Characterizing E-Science File Access Behavior via Latent Dirichlet Allocation

Author: Cécile Germain-renaud
Yusik Kim
Publication venue
Publication date: 01/01/2011
Field of study

Abstract—E-science is moving from grids to clouds. Getting the best of both worlds needs to build on the experience gained by the steady operation of production grids since some years. With the Grid Observatory initiative, trace data are publicly available to the computer science and engineering community and can be used for dimensioning and optimizing infrastructure. This paper proposes a new approach for analyzing behavioral traces: as most of them are indeed text documents, state of the art techniques in text mining, and specifically Latent Dirichlet Allocation, can be exploited. The advantages are twofold: providing some level of explanation inferred from the data; and a relatively scalable way to capture the temporal variability of the behavior of interest, while retaining the full dimensionality of the problem at hand. We experiment the text mining analogy approach by characterizing file access behavior. We validate the resulting probabilistic model by showing that it is capable of generating synthetic traces statistically consistent with the real ones

HAL-CentraleSupelec

CiteSeerX

Crossref

INRIA a CCSD electronic archive server

HAL-Rennes 1

Apprentissage artificiel pour l'ordonnancement des tâches dans les grilles de calcul

Author: GERMAIN-RENAUD Cécile
PEREZ Julien
Publication venue
Publication date: 01/01/2010
Field of study

ORSAY-PARIS 11-BU Sciences (914712101) / SudocSudocFranceF

OpenGrey Repository

The convergence of clouds, grids, and autonomics

Author: Germain-Renaud Cécile
Rana Omer Farood
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

This excerpt reports on a panel that took place in conjunction with the 2009 IEEE International Conference on Autonomic Computing. The full panel report, including the panelists' recommendations, is available for free on Computing Now (http://computingnow.computer.org/panel)

Online Research @ Cardiff