Search CORE

368 research outputs found

Monitoring and analysis system for performance troubleshooting in data centers

Author: Wang Chengwei
Publication venue: Georgia Institute of Technology
Publication date: 13/01/2014
Field of study

It was not long ago. On Christmas Eve 2012, a war of troubleshooting began in Amazon data centers. It started at 12:24 PM, with an mistaken deletion of the state data of Amazon Elastic Load Balancing Service (ELB for short), which was not realized at that time. The mistake first led to a local issue that a small number of ELB service APIs were affected. In about six minutes, it evolved into a critical one that EC2 customers were significantly affected. One example was that Netflix, which was using hundreds of Amazon ELB services, was experiencing an extensive streaming service outage when many customers could not watch TV shows or movies on Christmas Eve. It took Amazon engineers 5 hours 42 minutes to find the root cause, the mistaken deletion, and another 15 hours and 32 minutes to fully recover the ELB service. The war ended at 8:15 AM the next day and brought the performance troubleshooting in data centers to world’s attention. As shown in this Amazon ELB case.Troubleshooting runtime performance issues is crucial in time-sensitive multi-tier cloud services because of their stringent end-to-end timing requirements, but it is also notoriously difficult and time consuming. To address the troubleshooting challenge, this dissertation proposes VScope, a flexible monitoring and analysis system for online troubleshooting in data centers. VScope provides primitive operations which data center operators can use to troubleshoot various performance issues. Each operation is essentially a series of monitoring and analysis functions executed on an overlay network. We design a novel software architecture for VScope so that the overlay networks can be generated, executed and terminated automatically, on-demand. From the troubleshooting side, we design novel anomaly detection algorithms and implement them in VScope. By running anomaly detection algorithms in VScope, data center operators are notified when performance anomalies happen. We also design a graph-based guidance approach, called VFocus, which tracks the interactions among hardware and software components in data centers. VFocus provides primitive operations by which operators can analyze the interactions to find out which components are relevant to the performance issue. VScope’s capabilities and performance are evaluated on a testbed with over 1000 virtual machines (VMs). Experimental results show that the VScope runtime negligibly perturbs system and application performance, and requires mere seconds to deploy monitoring and analytics functions on over 1000 nodes. This demonstrates VScope’s ability to support fast operation and online queries against a comprehensive set of application to system/platform level metrics, and a variety of representative analytics functions. When supporting algorithms with high computation complexity, VScope serves as a ‘thin layer’ that occupies no more than 5% of their total latency. Further, by using VFocus, VScope can locate problematic VMs that cannot be found via solely application-level monitoring, and in one of the use cases explored in the dissertation, it operates with levels of perturbation of over 400% less than what is seen for brute-force and most sampling-based approaches. We also validate VFocus with real-world data center traces. The experimental results show that VFocus has troubleshooting accuracy of 83% on average.Ph.D

Scholarly Materials And Research @ Georgia Tech

Straggler Root-Cause and Impact Analysis for Massive-scale Virtualized Cloud Datacenters

Author: Garraghan P
McKee D
Ouyang X
Xu J
Yang R
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 20/09/2016
Field of study

Increased complexity and scale of virtualized distributed systems has resulted in the manifestation of emergent phenomena substantially affecting overall system performance. This phenomena is known as “Long Tail”, whereby a small proportion of task stragglers significantly impede job completion time. While work focuses on straggler detection and mitigation, there is limited work that empirically studies straggler root-cause and quantifies its impact upon system operation. Such analysis is critical to ascertain in-depth knowledge of straggler occurrence for focusing developmental and research efforts towards solving the Long Tail challenge. This paper provides an empirical analysis of straggler root-cause within virtualized Cloud datacenters; we analyze two large-scale production systems to quantify the frequency and impact stragglers impose, and propose a method for conducting root-cause analysis. Results demonstrate approximately 5% of task stragglers impact 50% of total jobs for batch processes, and 53% of stragglers occur due to high server resource utilization. We leverage these findings to propose a method for extreme straggler detection through a combination of offline execution patterns modeling and online analytic agents to monitor tasks at runtime. Experiments show the approach is capable of detecting stragglers less than 11% into their execution lifecycle with 95% accuracy for short duration jobs

Crossref

Lancaster E-Prints

White Rose Research Online

BDWatchdog: real-time monitoring and profiling of Big Data applications and frameworks

Author: Enes Jonatan
Expósito Roberto R.
Touriño Juan
Publication venue: 'Elsevier BV'
Publication date: 01/01/2018
Field of study

This is a post-peer-review, pre-copyedit version of an article published in Future Generation Computer Systems. The final authenticated version is available online at: https://doi.org/10.1016/j.future.2017.12.068[Abstract] Current Big Data applications are characterized by a heavy use of system resources (e.g., CPU, disk) generally distributed across a cluster. To effectively improve their performance there is a critical need for an accurate analysis of both Big Data workloads and frameworks. This means to fully understand how the system resources are being used in order to identify potential bottlenecks, from resource to code bottlenecks. This paper presents BDWatchdog, a novel framework that allows real-time and scalable analysis of Big Data applications by combining time series for resource monitorization and flame graphs for code profiling, focusing on the processes that make up the workload rather than the underlying instances on which they are executed. This shift from the traditional system-based monitorization to a process-based analysis is interesting for new paradigms such as software containers or serverless computing, where the focus is put on applications and not on instances. BDWatchdog has been evaluated on a Big Data cloud-based service deployed at the CESGA supercomputing center. The experimental results show that a process-based analysis allows for a more effective visualization and overall improves the understanding of Big Data workloads. BDWatchdog is publicly available at http://bdwatchdog.dec.udc.es.Ministerio de Economía, Industria y Competitividad; TIN2016-75845-PMinsiterio de Educación; FPU15/0338

Repositorio da Universidade da Coruña

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Virtual Machine Flow Analysis Using Host Kernel Tracing

Author: Nemati Hani
Publication venue
Publication date: 01/05/2019
Field of study

L’infonuagique a beaucoup gagné en popularité car elle permet d’offrir des services à coût réduit, avec le modèle économique Pay-to-Use, un stockage illimité avec les systèmes de stockage distribué, et une grande puissance de calcul grâce à l’accès direct au matériel. La technologie de virtualisation permet de partager un serveur physique entre plusieurs environnements virtualisés isolés, en déployant une couche logicielle (Hyperviseur) au-dessus du matériel. En conséquence, les environnements isolés peuvent fonctionner avec des systèmes d’exploitation et des applications différentes, sans interférence mutuelle. La croissance du nombre d’utilisateurs des services infonuagiques et la démocratisation de la technologie de virtualisation présentent un nouveau défi pour les fournisseurs de services infonuagiques. Fournir une bonne qualité de service et une haute disponibilité est une exigence principale pour l’infonuagique. La raison de la dégradation des performances d’une machine virtuelle peut être nombreuses. a Activité intense d’une application à l’intérieur de la machine virtuelle. b Conflits avec d’autres applications à l’intérieur de la machine même virtuelle. c Conflits avec d’autres machines virtuelles qui roulent sur la même machine physique. d Échecs de la plateforme infonuagique. Les deux premiers cas peuvent être gérés par le propriétaire de la machine virtuelle et les autres cas doivent être résolus par le fournisseur de l’infrastructure infonuagique. Ces infrastructures sont généralement très complexes et peuvent contenir différentes couches de virtualisation. Il est donc nécessaire d’avoir un outil d’analyse à faible surcoût pour détecter ces types de problèmes. Dans cette thèse, nous présentons une méthode précise permettant de récupérer le flux d’exécution des environnements virtualisés à partir de la machine hôte, quel que soit le niveau de la virtualisation. Pour éviter des problèmes de sécurité, faciliter le déploiement et minimiser le surcoût, notre méthode limite la collecte de données au niveau de l’hyperviseur. Pour analyser le comportement des machines virtuelles, nous utilisons un outil de traçage léger appelé Linux Trace Toolkit Next Generation (LTTng) [1]. LTTng est capable d’effectuer un traçage à haut débit et à faible surcoût, grâce aux mécanismes de synchronisation sans verrous utilisés pour mettre à jour le contenu des tampons de traçage.----------ABSTRACT: Cloud computing has gained popularity as it offers services at lower cost, with Pay-per-Use model, unlimited storage, with distributed storage, and flexible computational power, with direct hardware access. Virtualization technology allows to share a physical server, between several isolated virtualized environments, by deploying an hypervisor layer on top of hardware. As a result, each isolated environment can run with its OS and application without mutual interference. With the growth of cloud usage, and the use of virtualization, performance understanding and debugging are becoming a serious challenge for Cloud providers. Offering a better QoS and high availability are expected to be salient features of cloud computing. Nonetheless, possible reasons behind performance degradation in VMs are numerous. a) Heavy load of an application inside the VM. b) Contention with other applications inside the VM. c) Contention with other co-located VMs. d) Cloud platform failures. The first two cases can be managed by the VM owner, while the other cases need to be solved by the infrastructure provider. One key requirement for such a complex environment, with different virtualization layers, is a precise low overhead analysis tool. In this thesis, we present a host-based, precise method to recover the execution flow of virtualized environments, regardless of the level of nested virtualization. To avoid security issues, ease deployment and reduce execution overhead, our method limits its data collection to the hypervisor level. In order to analyse the behavior of each VM, we use a lightweight tracing tool called the Linux Trace Toolkit Next Generation (LTTng) [1]. LTTng is optimised for high throughput tracing with low overhead, thanks to its lock-free synchronization mechanisms used to update the trace buffer content

PolyPublie

Report from GI-Dagstuhl Seminar 16394: Software Performance Engineering in the DevOps World

Author: Jamshidi Pooyan
Leitner Philipp
van Hoorn Andre
Weber Ingo
Publication venue
Publication date: 01/01/2017
Field of study

This report documents the program and the outcomes of GI-Dagstuhl Seminar 16394 "Software Performance Engineering in the DevOps World". The seminar addressed the problem of performance-aware DevOps. Both, DevOps and performance engineering have been growing trends over the past one to two years, in no small part due to the rise in importance of identifying performance anomalies in the operations (Ops) of cloud and big data systems and feeding these back to the development (Dev). However, so far, the research community has treated software engineering, performance engineering, and cloud computing mostly as individual research areas. We aimed to identify cross-community collaboration, and to set the path for long-lasting collaborations towards performance-aware DevOps. The main goal of the seminar was to bring together young researchers (PhD students in a later stage of their PhD, as well as PostDocs or Junior Professors) in the areas of (i) software engineering, (ii) performance engineering, and (iii) cloud computing and big data to present their current research projects, to exchange experience and expertise, to discuss research challenges, and to develop ideas for future collaborations

arXiv.org e-Print Archive

Chalmers Research

Chalmers Publication Library

A flexible information service for management of virtualized software-defined infrastructures

Author: Clayman S
Galis A
Mamatas L
Publication venue: 'Royal College of Obstetricians & Gynaecologists (RCOG)'
Publication date: 15/06/2016
Field of study

Summary There is a major shift in the Internet towards using programmable and virtualized network devices, offering significant flexibility and adaptability. New networking paradigms such as software-defined networking and network function virtualization bring networks and IT domains closer together using appropriate architectural abstractions. In this context, new and novel information management features need to be introduced. The deployed management and control entities in these environments should have a clear, and often global, view of the network environment and should exchange information in alternative ways (e.g. some may have real-time constraints, while others may be throughput sensitive). Our work addresses these two network management features. In this paper, we define the research challenges in information management for virtualized highly dynamic environments. Along these lines, we introduce and present the design details of the virtual infrastructure information service, a new management information handling framework that (i) provides logically centralized information flow establishment, optimization, coordination, synchronization and management with respect to the diverse management and control entity demands; (ii) is designed according to the characteristics and requirements of software-defined networking and network function virtualization; and (iii) inter-operates with our own virtualized infrastructure framework. Evaluation results demonstrating the flexible and adaptable behaviour of the virtual infrastructure information service and its main operations are included in the paper. Copyright © 2016 John Wiley & Sons, Ltd

UCL Discovery

Toposch: Latency-Aware Scheduling Based on Critical Path Analysis on Shared YARN Clusters

Author: Hu C
Peng H
Ranjan R
Wo T
Xu J
Xue S
Yang R
Yu X
Zhu J
Publication venue
Publication date
Field of study

Crossref

White Rose Research Online