368 research outputs found

    Monitoring and analysis system for performance troubleshooting in data centers

    Get PDF
    It was not long ago. On Christmas Eve 2012, a war of troubleshooting began in Amazon data centers. It started at 12:24 PM, with an mistaken deletion of the state data of Amazon Elastic Load Balancing Service (ELB for short), which was not realized at that time. The mistake first led to a local issue that a small number of ELB service APIs were affected. In about six minutes, it evolved into a critical one that EC2 customers were significantly affected. One example was that Netflix, which was using hundreds of Amazon ELB services, was experiencing an extensive streaming service outage when many customers could not watch TV shows or movies on Christmas Eve. It took Amazon engineers 5 hours 42 minutes to find the root cause, the mistaken deletion, and another 15 hours and 32 minutes to fully recover the ELB service. The war ended at 8:15 AM the next day and brought the performance troubleshooting in data centers to world’s attention. As shown in this Amazon ELB case.Troubleshooting runtime performance issues is crucial in time-sensitive multi-tier cloud services because of their stringent end-to-end timing requirements, but it is also notoriously difficult and time consuming. To address the troubleshooting challenge, this dissertation proposes VScope, a flexible monitoring and analysis system for online troubleshooting in data centers. VScope provides primitive operations which data center operators can use to troubleshoot various performance issues. Each operation is essentially a series of monitoring and analysis functions executed on an overlay network. We design a novel software architecture for VScope so that the overlay networks can be generated, executed and terminated automatically, on-demand. From the troubleshooting side, we design novel anomaly detection algorithms and implement them in VScope. By running anomaly detection algorithms in VScope, data center operators are notified when performance anomalies happen. We also design a graph-based guidance approach, called VFocus, which tracks the interactions among hardware and software components in data centers. VFocus provides primitive operations by which operators can analyze the interactions to find out which components are relevant to the performance issue. VScope’s capabilities and performance are evaluated on a testbed with over 1000 virtual machines (VMs). Experimental results show that the VScope runtime negligibly perturbs system and application performance, and requires mere seconds to deploy monitoring and analytics functions on over 1000 nodes. This demonstrates VScope’s ability to support fast operation and online queries against a comprehensive set of application to system/platform level metrics, and a variety of representative analytics functions. When supporting algorithms with high computation complexity, VScope serves as a ‘thin layer’ that occupies no more than 5% of their total latency. Further, by using VFocus, VScope can locate problematic VMs that cannot be found via solely application-level monitoring, and in one of the use cases explored in the dissertation, it operates with levels of perturbation of over 400% less than what is seen for brute-force and most sampling-based approaches. We also validate VFocus with real-world data center traces. The experimental results show that VFocus has troubleshooting accuracy of 83% on average.Ph.D

    Straggler Root-Cause and Impact Analysis for Massive-scale Virtualized Cloud Datacenters

    Get PDF
    Increased complexity and scale of virtualized distributed systems has resulted in the manifestation of emergent phenomena substantially affecting overall system performance. This phenomena is known as “Long Tail”, whereby a small proportion of task stragglers significantly impede job completion time. While work focuses on straggler detection and mitigation, there is limited work that empirically studies straggler root-cause and quantifies its impact upon system operation. Such analysis is critical to ascertain in-depth knowledge of straggler occurrence for focusing developmental and research efforts towards solving the Long Tail challenge. This paper provides an empirical analysis of straggler root-cause within virtualized Cloud datacenters; we analyze two large-scale production systems to quantify the frequency and impact stragglers impose, and propose a method for conducting root-cause analysis. Results demonstrate approximately 5% of task stragglers impact 50% of total jobs for batch processes, and 53% of stragglers occur due to high server resource utilization. We leverage these findings to propose a method for extreme straggler detection through a combination of offline execution patterns modeling and online analytic agents to monitor tasks at runtime. Experiments show the approach is capable of detecting stragglers less than 11% into their execution lifecycle with 95% accuracy for short duration jobs

    BDWatchdog: real-time monitoring and profiling of Big Data applications and frameworks

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in Future Generation Computer Systems. The final authenticated version is available online at: https://doi.org/10.1016/j.future.2017.12.068[Abstract] Current Big Data applications are characterized by a heavy use of system resources (e.g., CPU, disk) generally distributed across a cluster. To effectively improve their performance there is a critical need for an accurate analysis of both Big Data workloads and frameworks. This means to fully understand how the system resources are being used in order to identify potential bottlenecks, from resource to code bottlenecks. This paper presents BDWatchdog, a novel framework that allows real-time and scalable analysis of Big Data applications by combining time series for resource monitorization and flame graphs for code profiling, focusing on the processes that make up the workload rather than the underlying instances on which they are executed. This shift from the traditional system-based monitorization to a process-based analysis is interesting for new paradigms such as software containers or serverless computing, where the focus is put on applications and not on instances. BDWatchdog has been evaluated on a Big Data cloud-based service deployed at the CESGA supercomputing center. The experimental results show that a process-based analysis allows for a more effective visualization and overall improves the understanding of Big Data workloads. BDWatchdog is publicly available at http://bdwatchdog.dec.udc.es.Ministerio de EconomĂ­a, Industria y Competitividad; TIN2016-75845-PMinsiterio de EducaciĂłn; FPU15/0338

    Virtual Machine Flow Analysis Using Host Kernel Tracing

    Get PDF
    L’infonuagique a beaucoup gagnĂ© en popularitĂ© car elle permet d’offrir des services Ă  coĂ»t rĂ©duit, avec le modĂšle Ă©conomique Pay-to-Use, un stockage illimitĂ© avec les systĂšmes de stockage distribuĂ©, et une grande puissance de calcul grĂące Ă  l’accĂšs direct au matĂ©riel. La technologie de virtualisation permet de partager un serveur physique entre plusieurs environnements virtualisĂ©s isolĂ©s, en dĂ©ployant une couche logicielle (Hyperviseur) au-dessus du matĂ©riel. En consĂ©quence, les environnements isolĂ©s peuvent fonctionner avec des systĂšmes d’exploitation et des applications diffĂ©rentes, sans interfĂ©rence mutuelle. La croissance du nombre d’utilisateurs des services infonuagiques et la dĂ©mocratisation de la technologie de virtualisation prĂ©sentent un nouveau dĂ©fi pour les fournisseurs de services infonuagiques. Fournir une bonne qualitĂ© de service et une haute disponibilitĂ© est une exigence principale pour l’infonuagique. La raison de la dĂ©gradation des performances d’une machine virtuelle peut ĂȘtre nombreuses. a ActivitĂ© intense d’une application Ă  l’intĂ©rieur de la machine virtuelle. b Conflits avec d’autres applications Ă  l’intĂ©rieur de la machine mĂȘme virtuelle. c Conflits avec d’autres machines virtuelles qui roulent sur la mĂȘme machine physique. d Échecs de la plateforme infonuagique. Les deux premiers cas peuvent ĂȘtre gĂ©rĂ©s par le propriĂ©taire de la machine virtuelle et les autres cas doivent ĂȘtre rĂ©solus par le fournisseur de l’infrastructure infonuagique. Ces infrastructures sont gĂ©nĂ©ralement trĂšs complexes et peuvent contenir diffĂ©rentes couches de virtualisation. Il est donc nĂ©cessaire d’avoir un outil d’analyse Ă  faible surcoĂ»t pour dĂ©tecter ces types de problĂšmes. Dans cette thĂšse, nous prĂ©sentons une mĂ©thode prĂ©cise permettant de rĂ©cupĂ©rer le flux d’exĂ©cution des environnements virtualisĂ©s Ă  partir de la machine hĂŽte, quel que soit le niveau de la virtualisation. Pour Ă©viter des problĂšmes de sĂ©curitĂ©, faciliter le dĂ©ploiement et minimiser le surcoĂ»t, notre mĂ©thode limite la collecte de donnĂ©es au niveau de l’hyperviseur. Pour analyser le comportement des machines virtuelles, nous utilisons un outil de traçage lĂ©ger appelĂ© Linux Trace Toolkit Next Generation (LTTng) [1]. LTTng est capable d’effectuer un traçage Ă  haut dĂ©bit et Ă  faible surcoĂ»t, grĂące aux mĂ©canismes de synchronisation sans verrous utilisĂ©s pour mettre Ă  jour le contenu des tampons de traçage.----------ABSTRACT: Cloud computing has gained popularity as it offers services at lower cost, with Pay-per-Use model, unlimited storage, with distributed storage, and flexible computational power, with direct hardware access. Virtualization technology allows to share a physical server, between several isolated virtualized environments, by deploying an hypervisor layer on top of hardware. As a result, each isolated environment can run with its OS and application without mutual interference. With the growth of cloud usage, and the use of virtualization, performance understanding and debugging are becoming a serious challenge for Cloud providers. Offering a better QoS and high availability are expected to be salient features of cloud computing. Nonetheless, possible reasons behind performance degradation in VMs are numerous. a) Heavy load of an application inside the VM. b) Contention with other applications inside the VM. c) Contention with other co-located VMs. d) Cloud platform failures. The first two cases can be managed by the VM owner, while the other cases need to be solved by the infrastructure provider. One key requirement for such a complex environment, with different virtualization layers, is a precise low overhead analysis tool. In this thesis, we present a host-based, precise method to recover the execution flow of virtualized environments, regardless of the level of nested virtualization. To avoid security issues, ease deployment and reduce execution overhead, our method limits its data collection to the hypervisor level. In order to analyse the behavior of each VM, we use a lightweight tracing tool called the Linux Trace Toolkit Next Generation (LTTng) [1]. LTTng is optimised for high throughput tracing with low overhead, thanks to its lock-free synchronization mechanisms used to update the trace buffer content

    Report from GI-Dagstuhl Seminar 16394: Software Performance Engineering in the DevOps World

    Get PDF
    This report documents the program and the outcomes of GI-Dagstuhl Seminar 16394 "Software Performance Engineering in the DevOps World". The seminar addressed the problem of performance-aware DevOps. Both, DevOps and performance engineering have been growing trends over the past one to two years, in no small part due to the rise in importance of identifying performance anomalies in the operations (Ops) of cloud and big data systems and feeding these back to the development (Dev). However, so far, the research community has treated software engineering, performance engineering, and cloud computing mostly as individual research areas. We aimed to identify cross-community collaboration, and to set the path for long-lasting collaborations towards performance-aware DevOps. The main goal of the seminar was to bring together young researchers (PhD students in a later stage of their PhD, as well as PostDocs or Junior Professors) in the areas of (i) software engineering, (ii) performance engineering, and (iii) cloud computing and big data to present their current research projects, to exchange experience and expertise, to discuss research challenges, and to develop ideas for future collaborations

    A flexible information service for management of virtualized software-defined infrastructures

    Get PDF
    Summary There is a major shift in the Internet towards using programmable and virtualized network devices, offering significant flexibility and adaptability. New networking paradigms such as software-defined networking and network function virtualization bring networks and IT domains closer together using appropriate architectural abstractions. In this context, new and novel information management features need to be introduced. The deployed management and control entities in these environments should have a clear, and often global, view of the network environment and should exchange information in alternative ways (e.g. some may have real-time constraints, while others may be throughput sensitive). Our work addresses these two network management features. In this paper, we define the research challenges in information management for virtualized highly dynamic environments. Along these lines, we introduce and present the design details of the virtual infrastructure information service, a new management information handling framework that (i) provides logically centralized information flow establishment, optimization, coordination, synchronization and management with respect to the diverse management and control entity demands; (ii) is designed according to the characteristics and requirements of software-defined networking and network function virtualization; and (iii) inter-operates with our own virtualized infrastructure framework. Evaluation results demonstrating the flexible and adaptable behaviour of the virtual infrastructure information service and its main operations are included in the paper. Copyright © 2016 John Wiley & Sons, Ltd
    • 

    corecore