48 research outputs found

    Utility-based Reinforcement Learning for Reactive Grids

    Get PDF
    International audienceLarge scale production grids are an important case for autonomic computing. They follow a mutualization paradigm: decision-making (human or automatic) is distributed and largely independent, and, at the same time, it must implement the highlevel goals of the grid management. This paper deals with the scheduling problem with two partially conflicting goals: fairshare and Quality of Service (QoS). Fair sharing is a wellknown issue motivated by return on investment for participating institutions. Differentiated QoS has emerged as an important and unexpected requirement in the current usage of production grids. In the framework of the EGEE grid (one of the largest existing grids), applications from diverse scientific communities require a pseudo-interactive response time. More generally, seamless integration of the grid power into everyday use calls for unplanned and interactive access to grid resources, which defines reactive grids. The major result of this paper is that the combination of utility functions and reinforcement learning (RL) provides a general and efficient method for dynamically allocating grid resources in order to satisfy both end users with differentiated requirements and participating institutions. Combining RL methods and utility functions for resource allocation was pioneered by Tesauro and Vengerov. While the application contexts are different, the resource allocation issues are very similar. The main difference in our work is that we consider a multi-criteria optimization problem that includes a fair-share objective. A first contribution of our work is the definition of a set of variables describing states and actions that allows us to formulate the grid scheduling problem as a continuous action-state space reinforcement learning problem. To capture the immediate goals of end users and the long-term objectives of administrators, we propose automatically derived utility functions. Finally, our experimental results on a synthetic workload and a real EGEE trace show that RL clearly outperforms the classical schedulers, so it is a realistic alternative to empirical scheduler design

    La surveillance efficace de bout-à-bout pour la gestion des pannes dans les systÚmes distribués

    Get PDF
    Dans cette thĂšse, nous prĂ©sentons notre travail sur la gestion des pannes dans les systĂšmes distribuĂ©s, avec comme motivation principale le suivi de fautes et de changements brusques dans de grands systĂšmes informatiques comme la grille et le cloud.Au lieu de construire une connaissance complĂšte a priori du logiciel et des infrastructures matĂ©rielles comme dans les mĂ©thodes traditionnelles de dĂ©tection ou de diagnostic, nous proposons d'utiliser des techniques spĂ©cifiques pour effectuer une surveillance de bout en bout dans des systĂšmes de grande envergure, en laissant les dĂ©tails inaccessibles des composants impliquĂ©s dans une boĂźte noire.Pour la surveillance de pannes d'un systĂšme distribuĂ©, nous modĂ©lisons tout d'abord cette application basĂ©e sur des sondes comme une tĂąche de prĂ©diction statique de collaboration (CP), et dĂ©montrons expĂ©rimentalement l'efficacitĂ© des mĂ©thodes de CP en utilisant une mĂ©thode de la max margin matrice factorisation. Nous introduisons en outre l apprentissage actif dans le cadre de CP et exposons son avantage essentiel dans le traitement de donnĂ©es trĂšs dĂ©sĂ©quilibrĂ©es, ce qui est particuliĂšrement utile pour identifier la class de classe de dĂ©faut de la minoritĂ©.Nous Ă©tendons ensuite la surveillance statique de dĂ©fection au cas sĂ©quentiel en proposant la mĂ©thode de factorisation sĂ©quentielle de matrice (SMF). La SMF prend une sĂ©quence de matrices partiellement observĂ©es en entrĂ©e, et produit des prĂ©dictions comportant des informations Ă  la fois sur les fenĂȘtres temporelles actuelle et passĂ©. L apprentissage actif est Ă©galement utilisĂ© pour la SMF, de sorte que les donnĂ©es trĂšs dĂ©sĂ©quilibrĂ©es peuvent ĂȘtre traitĂ©es correctement. En plus des mĂ©thodes sĂ©quentielles, une action de lissage pris sur la sĂ©quence d'estimation s'est avĂ©rĂ©e ĂȘtre une astuce pratique utile pour amĂ©liorer la performance de la prĂ©diction sĂ©quentielle.Du fait que l'hypothĂšse de stationnaritĂ© utilisĂ©e dans le surveillance statique et sĂ©quentielle devient irrĂ©aliste en prĂ©sence de changements brusques, nous proposons un framework en ligne semi-supervisĂ© de dĂ©tection de changement (SSOCD) qui permette de dĂ©tecter des changements intentionnels dans les donnĂ©es de sĂ©ries temporelles. De cette maniĂšre, le modĂšle statique du systĂšme peut ĂȘtre recalculĂ© une fois un changement brusque est dĂ©tectĂ©. Dans SSOCD, un procĂ©dĂ© hors ligne non supervisĂ© est proposĂ© pour analyser un Ă©chantillon des sĂ©ries de donnĂ©es. Les points de changement ainsi dĂ©tectĂ©s sont utilisĂ©s pour entraĂźner un modĂšle en ligne supervisĂ©, qui fournit une dĂ©cision en ligne concernant la dĂ©tection de changement Ă  parti de la sĂ©quence de donnĂ©es en entrĂ©e. Les mĂ©thodes de dĂ©tection de changements de l Ă©tat de l art sont utilisĂ©es pour dĂ©montrer l'utilitĂ© de ce framework.Tous les travaux prĂ©sentĂ©s sont vĂ©rifiĂ©s sur des ensembles de donnĂ©es du monde rĂ©el. Plus prĂ©cisĂ©ment, les expĂ©riences de surveillance de panne sont effectuĂ©es sur un ensemble de donnĂ©es recueillies auprĂšs de l infrastructure de grille Biomed faisant partie de l European Grid Initiative et le framework de dĂ©tection de changement brusque est vĂ©rifiĂ© sur un ensemble de donnĂ©es concernant le changement de performance d'un site en ligne ayant un fort trafic.In this dissertation, we present our work on fault management in distributed systems, with motivating application roots in monitoring fault and abrupt change of large computing systems like the grid and the cloud. Instead of building a complete a priori knowledge of the software and hardware infrastructures as in conventional detection or diagnosis methods, we propose to use appropriate techniques to perform end-to-end monitoring for such large scale systems, leaving the inaccessible details of involved components in a black box.For the fault monitoring of a distributed system, we first model this probe-based application as a static collaborative prediction (CP) task, and experimentally demonstrate the effectiveness of CP methods by using the max margin matrix factorization method. We further introduce active learning to the CP framework and exhibit its critical advantage in dealing with highly imbalanced data, which is specially useful for identifying the minority fault class.Further we extend the static fault monitoring to the sequential case by proposing the sequential matrix factorization (SMF) method. SMF takes a sequence of partially observed matrices as input, and produces predictions with information both from the current and history time windows. Active learning is also employed to SMF, such that the highly imbalanced data can be coped with properly. In addition to the sequential methods, a smoothing action taken on the estimation sequence has shown to be a practically useful trick for enhancing sequential prediction performance.Since the stationary assumption employed in the static and sequential fault monitoring becomes unrealistic in the presence of abrupt changes, we propose a semi-supervised online change detection (SSOCD) framework to detect intended changes in time series data. In this way, the static model of the system can be recomputed once an abrupt change is detected. In SSOCD, an unsupervised offline method is proposed to analyze a sample data series. The change points thus detected are used to train a supervised online model, which gives online decision about whether there is a change presented in the arriving data sequence. State-of-the-art change detection methods are employed to demonstrate the usefulness of the framework.All presented work is verified on real-world datasets. Specifically, the fault monitoring experiments are conducted on a dataset collected from the Biomed grid infrastructure within the European Grid Initiative, and the abrupt change detection framework is verified on a dataset concerning the performance change of an online site with large amount of traffic.PARIS11-SCD-Bib. Ă©lectronique (914719901) / SudocSudocFranceF

    The Green Computing Observatory: a data curation approach for green IT

    Get PDF
    International audienceThe first barrier to improved energy efficiency is the difficulty of collecting data on the energy consumption of individual components of data centers, and the lack of overall data collection. GCO collects monitoring data on energy consumption of a large computing center, and publish them through the Grid Observatory. These data include the detailed monitoring of the processors and motherboards, as well as the global site information, such as overall consumption and overall cooling. A second barrier is making the collected data usable. The difficulty is to make the data readily consistent and complete, as well as understandable for further exploitation. For this purpose, GCO opts for an ontological approach in order to rigorously define the semantics of the data (what is measured) and the context of their production (how are they acquired and/or calculated). The Green Computing Observatory (GCO) addresses the previous issues within the framework of a production infrastructure dedicated to e-science, providing a unique facility for the Computer Science and Engineering community. The overall goal is to create a full-fledged data curation process. This paper reports on the first achievements, specifically acquisition and ontology

    Grid Differentiated Services: a Reinforcement Learning Approach

    Get PDF
    International audienceLarge scale production grids are a major case for autonomic computing. Following the classical definition of Kephart, an autonomic computing system should optimize its own behavior in accordance with high level guidance from humans. This central tenet of this paper is that the combination of utility functions and reinforcement learning (RL) can provide a general and efficient method for dynamically allocating grid resources in order to optimize the satisfaction of both endusers and participating institutions. The flexibility of an RLbased system allows to model the state of the grid, the jobs to be scheduled, and the high-level objectives of the various actors on the grid. RL-based scheduling can seamlessly adapt its decisions to changes in the distributions of inter-arrival time, QoS requirements, and resource availability. Moreover, it requires minimal prior knowledge about the target environment, including user requests and infrastructure. Our experimental results, both on a synthetic workload and a real trace, show that RL is not only a realistic alternative to empirical scheduler design, but is able to outperform them

    ABSTRACT

    No full text
    Result checking is the theory and practice of proving that the result of an execution of a program on an input is correct. Result checking has most often been envisioned in the framework of program testing or property testing, where the issue is the conformity of the program to some a-priori specification. Very large scale distributed computing systems demand to tackle the issue of computation correctness, albeit from hypothesis very different from the program testing ones. The general issues examined in this paper are the following. First, the definition of checking methods adapted to large-scale Monte-Carlo simulations; for these applications, no external criterion can be used to assess the quality of the result. Second, two result checking algorithms which minimize the overall overhead through an adaptive strategy. Finally, a specialization of this framework to a case study, the Auger astrophysics experiment. Our main contributions are: first to focus on checking Monte-Carlo simulations, which have rarely been considered previously; second to define a probabilistic checking strategy including the risk of first kind (false positive) as well as the risk of second kind (false negative) which is usually the only one considered, and which is compatible with Byzantine saboteurs; third, to exploit the probable characteristics of the behaviour of the saboteurs to optimise for the most frequent case. Finally, we show on a case study that the implementation details can be carried out

    Java-Based Coupling for Parallel Predictive-Adaptive Domain Decomposition

    No full text
    Adaptive domain decomposition exemplifies the problem of integrating heterogeneous software components with intermediate coupling granularity. This paper describes an experiment where a data‐parallel (HPF) client interfaces with a sequential computation server through Java. We show that seamless integration of data‐parallelism is possible, but requires most of the tools from the Java palette: Java Native Interface (JNI), Remote Method Invocation (RMI), callbacks and threads

    Characterizing E-Science File Access Behavior via Latent Dirichlet Allocation

    Get PDF
    Abstract—E-science is moving from grids to clouds. Getting the best of both worlds needs to build on the experience gained by the steady operation of production grids since some years. With the Grid Observatory initiative, trace data are publicly available to the computer science and engineering community and can be used for dimensioning and optimizing infrastructure. This paper proposes a new approach for analyzing behavioral traces: as most of them are indeed text documents, state of the art techniques in text mining, and specifically Latent Dirichlet Allocation, can be exploited. The advantages are twofold: providing some level of explanation inferred from the data; and a relatively scalable way to capture the temporal variability of the behavior of interest, while retaining the full dimensionality of the problem at hand. We experiment the text mining analogy approach by characterizing file access behavior. We validate the resulting probabilistic model by showing that it is capable of generating synthetic traces statistically consistent with the real ones

    Apprentissage artificiel pour l'ordonnancement des tĂąches dans les grilles de calcul

    No full text
    ORSAY-PARIS 11-BU Sciences (914712101) / SudocSudocFranceF

    The convergence of clouds, grids, and autonomics

    No full text
    This excerpt reports on a panel that took place in conjunction with the 2009 IEEE International Conference on Autonomic Computing. The full panel report, including the panelists' recommendations, is available for free on Computing Now (http://computingnow.computer.org/panel)
    corecore