    Online Testing of Federated and Heterogeneous Distributed Systems

    Enforcing CPU allocation in a heterogeneous IaaS

    International audienceIn an Infrastructure as a Service (IaaS), the amount of resources allocated to a virtual machine (VM) at creation time may be expressed with relative values (relative to the hardware, i.e., a fraction of the capacity of a device) or absolute values (i.e., a performance metric which is independent from the capacity of the hardware). Surprisingly, disk or network resource allocations are expressed with absolute values (bandwidth), but CPU resource allocations are expressed with relative values (a percentage of a processor). The major problem with CPU relative value allocations is that it depends on the capacity of the CPU, which may vary due to different factors (server heterogeneity in a cluster, Dynamic Voltage Frequency Scaling (DVFS)). In this paper, we analyze the side effects and drawbacks of relative allocations. We claim that CPU allocation should be expressed with absolute values. We propose such a CPU resource management system and we demonstrate and evaluate its benefits

    Performance et qualité de service de l'ordonnanceur dans un environnement virtualisé

    Confrontées à l'augmentation des coûts de mise en place et de maintenance des systèmes informatiques, les entreprises se tournent vers des solutions d'externalisation telles que le Cloud Computing. Le Cloud se basent sur la virtualisation comme principale technologie permettant la mutualisation. L'utilisation de la virtualisation apporte de nombreux défis donc les principaux portent sur les performances des applications dans les machines virtuelles (VM) et la prévisibilité de ces performances. Dans un système virtualisé, les ressources matérielles sont partagées entre toutes les VMs du système. Dans le cas du CPU, c'est l'ordonnanceur de l'hyperviseur qui se charge de le partager entre tous les processeurs virtuels (vCPU) des VMs. L'hyperviseur réalise une allocation à temps partagé du CPU entre tous les vCPUs des VMs. Chaque vCPU a accès au CPU périodiquement. Ainsi, les vCPUs des VMs n'ont pas accès de façon continue au CPU, mais plutôt discontinue. Cette discontinuité est à l'origine de nombreux problèmes sur des mécanismes tels que la gestion d'interruption et les mécanismes de synchronisation de bas niveau dans les OS invités. Dans cette thèse, nous proposons deux contributions pour répondre à ces problèmes dans la virtualisation. La première est un nouvel ordonnanceur de l'hyperviseur qui adapte dynamiquement la valeur du quantum dans l'hyperviseur en fonction du type des applications dans les VMs sur une plate-forme multi-coeurs. La seconde contribution est une nouvelle primitive de synchronisation (nommée I-Spinlock) dans l'OS invité. Dans un Cloud fournissant un service du type IaaS, la VM est l'unité d'allocation. Le fournisseur établit un catalogue des types de VMs présentant les différentes quantités de ressources qui sont allouées à la VM vis-à-vis des différents périphériques. Ces ressources allouées à la VM correspondent à un contrat sur une qualité de service négocié par le client auprès du fournisseur. L'imprévisibilité des performances est la conséquence de l'incapacité du fournisseur à garantir cette qualité de service. Deux principales causes sont à l'origine de ce problème dans le Cloud: (i) un mauvais partage des ressources entre les différentes VMs et (ii) l'hétérogénéité des infrastructures dans les centres d'hébergement. Dans cette thèse, nous proposons deux contributions pour répondre au problème d'imprévisibilité des performances. La première contribution s'intéresse au partage de la ressource logicielle responsable de la gestion des pilotes, et propose une approche de facturation du temps CPU utilisé par cette couche logiciel aux VMs. La deuxième contribution s'intéresse à l'allocation du CPU dans les Clouds hétérogènes. Dans cette contribution, nous proposons une approche d'allocation permettant de garantir la capacité de calcul allouée à une VM quelle que soit l'hétérogénéité des CPUs dans l'infrastructure

    An Integrated Framework for Improving the Quality and Reliability of Software Upgrades

    Despite major advances in the engineering of maintainable and robust software over the years, upgrading software remains a primitive and error-prone activity. In this dissertation, we argue that several problems with upgrading software are caused by a poor integration between upgrade deployment, testing, and problem reporting. To support this argument, we present a characterization of software upgrades resulting from a survey we conducted of 50 system administrators. Motivated by the survey results, we present Mirage, a distributed framework for integrating upgrade deployment, testing, and problem reporting into the overall upgrade development process. Mirage's deployment subsystem allows the vendor to deploy its upgrades in stages over clusters of users sharing similar environments. Staged deployment incorporates testing of the upgrade on the users' machines. It is effective in allowing the vendor to detect problems early and limit the dissemination of buggy upgrades. Oasis, the testing subsystem of Mirage, improves on current state-of-the-art concolic and symbolic engines by implementing a new heuristic to prioritize the exploration of new or affected code in the upgrade. Furthermore, interactive symbolic execution, a new approach exposing the problem of path exploration to the tester using a graphical user interface, can be used to develop new search heuristics or manually guide testing to important areas of the source code. In spite of all of these efforts, some bugs are bound to remain in the software when it is deployed, and will be discovered and reported only later by the users. With the last component of Mirage, we consider the problem of instrumenting programs to reproduce bugs effectively, while keeping user data private. In particular, we develop static and dynamic analysis techniques to minimize the amount of instrumentation, and therefore the overhead incurred by the users, while considerably speeding up debugging. By combining up-front testing, stage deployment, testing on user machines, and efficient reporting, Mirage successfully reduces the number of problems, minimizes the number of users affected, and shortens the time needed to fix remaining problems

    Online testing of federated and heterogeneous distributed systems

    DiCE is a system for online testing of federated and heterogeneous distributed systems. We have built a prototype of DiCE and integrated it with an open-source BGP router. DiCE quickly detects three important classes of faults, resulting from configuration mistakes, policy conflicts and programming errors. The goal of this demo is to showcase our DiCE prototype while it executes an experiment that involves exploring BGP system behavior in a topology with 27 BGP routers and Internet-like conditions (Figure 1)

    Toward Online Testing of Federated and Heterogeneous Distributed Systems

    Making distributed systems reliable is notoriously difficult. It is even more difficult to achieve high reliability for federated and heterogeneous systems, i.e., those that are operated by multiple administrative entities and have numerous inter-operable implementations. A prime example of such a system is the Internet's inter-domain routing, today based on BGP. We argue that system reliability should be improved by proactively identifying potential faults using an online testing functionality. We propose DiCE, an approach that continuously and automatically explores the system behavior, to check whether the system deviates from its desired behavior. DiCE orchestrates the exploration of relevant system behaviors by subjecting system nodes to many possible inputs that exercise node actions. DiCE starts exploring from current, live system state, and operates in isolation from the deployed system. We describe our experience in integrating DiCE with an open-source BGP router. We evaluate the prototype's ability to quickly detect origin misconfiguration, a recurring operator mistake that causes Internet-wide outages. We also quantify DiCE's overhead and find it to have marginal impact on system performance