8,492 research outputs found

    Monitoring and analysis system for performance troubleshooting in data centers

    Get PDF
    It was not long ago. On Christmas Eve 2012, a war of troubleshooting began in Amazon data centers. It started at 12:24 PM, with an mistaken deletion of the state data of Amazon Elastic Load Balancing Service (ELB for short), which was not realized at that time. The mistake first led to a local issue that a small number of ELB service APIs were affected. In about six minutes, it evolved into a critical one that EC2 customers were significantly affected. One example was that Netflix, which was using hundreds of Amazon ELB services, was experiencing an extensive streaming service outage when many customers could not watch TV shows or movies on Christmas Eve. It took Amazon engineers 5 hours 42 minutes to find the root cause, the mistaken deletion, and another 15 hours and 32 minutes to fully recover the ELB service. The war ended at 8:15 AM the next day and brought the performance troubleshooting in data centers to world’s attention. As shown in this Amazon ELB case.Troubleshooting runtime performance issues is crucial in time-sensitive multi-tier cloud services because of their stringent end-to-end timing requirements, but it is also notoriously difficult and time consuming. To address the troubleshooting challenge, this dissertation proposes VScope, a flexible monitoring and analysis system for online troubleshooting in data centers. VScope provides primitive operations which data center operators can use to troubleshoot various performance issues. Each operation is essentially a series of monitoring and analysis functions executed on an overlay network. We design a novel software architecture for VScope so that the overlay networks can be generated, executed and terminated automatically, on-demand. From the troubleshooting side, we design novel anomaly detection algorithms and implement them in VScope. By running anomaly detection algorithms in VScope, data center operators are notified when performance anomalies happen. We also design a graph-based guidance approach, called VFocus, which tracks the interactions among hardware and software components in data centers. VFocus provides primitive operations by which operators can analyze the interactions to find out which components are relevant to the performance issue. VScope’s capabilities and performance are evaluated on a testbed with over 1000 virtual machines (VMs). Experimental results show that the VScope runtime negligibly perturbs system and application performance, and requires mere seconds to deploy monitoring and analytics functions on over 1000 nodes. This demonstrates VScope’s ability to support fast operation and online queries against a comprehensive set of application to system/platform level metrics, and a variety of representative analytics functions. When supporting algorithms with high computation complexity, VScope serves as a ‘thin layer’ that occupies no more than 5% of their total latency. Further, by using VFocus, VScope can locate problematic VMs that cannot be found via solely application-level monitoring, and in one of the use cases explored in the dissertation, it operates with levels of perturbation of over 400% less than what is seen for brute-force and most sampling-based approaches. We also validate VFocus with real-world data center traces. The experimental results show that VFocus has troubleshooting accuracy of 83% on average.Ph.D

    Enhancing Failure Propagation Analysis in Cloud Computing Systems

    Full text link
    In order to plan for failure recovery, the designers of cloud systems need to understand how their system can potentially fail. Unfortunately, analyzing the failure behavior of such systems can be very difficult and time-consuming, due to the large volume of events, non-determinism, and reuse of third-party components. To address these issues, we propose a novel approach that joins fault injection with anomaly detection to identify the symptoms of failures. We evaluated the proposed approach in the context of the OpenStack cloud computing platform. We show that our model can significantly improve the accuracy of failure analysis in terms of false positives and negatives, with a low computational cost.Comment: 12 pages, The 30th International Symposium on Software Reliability Engineering (ISSRE 2019

    Program Analysis of Commodity IoT Applications for Security and Privacy: Challenges and Opportunities

    Full text link
    Recent advances in Internet of Things (IoT) have enabled myriad domains such as smart homes, personal monitoring devices, and enhanced manufacturing. IoT is now pervasive---new applications are being used in nearly every conceivable environment, which leads to the adoption of device-based interaction and automation. However, IoT has also raised issues about the security and privacy of these digitally augmented spaces. Program analysis is crucial in identifying those issues, yet the application and scope of program analysis in IoT remains largely unexplored by the technical community. In this paper, we study privacy and security issues in IoT that require program-analysis techniques with an emphasis on identified attacks against these systems and defenses implemented so far. Based on a study of five IoT programming platforms, we identify the key insights that result from research efforts in both the program analysis and security communities and relate the efficacy of program-analysis techniques to security and privacy issues. We conclude by studying recent IoT analysis systems and exploring their implementations. Through these explorations, we highlight key challenges and opportunities in calibrating for the environments in which IoT systems will be used.Comment: syntax and grammar error are fixed, and IoT platforms are updated to match with the submissio

    The Role of a Microservice Architecture on cybersecurity and operational resilience in critical systems

    Get PDF
    Critical systems are characterized by their high degree of intolerance to threats, in other words, their high level of resilience, because depending on the context in which the system is inserted, the slightest failure could imply significant damage, whether in economic terms, or loss of reputation, of information, of infrastructure, of the environment, or human life. The security of such systems is traditionally associated with legacy infrastructures and data centers that are monolithic, which translates into increasingly high evolution and protection challenges. In the current context of rapid transformation where the variety of threats to systems has been consistently increasing, this dissertation aims to carry out a compatibility study of the microservice architecture, which is denoted by its characteristics such as resilience, scalability, modifiability and technological heterogeneity, being flexible in structural adaptations, and in rapidly evolving and highly complex settings, making it suited for agile environments. It also explores what response artificial intelligence, more specifically machine learning, can provide in a context of security and monitorability when combined with a simple banking system that adopts the microservice architecture.Os sistemas críticos são caracterizados pelo seu elevado grau de intolerância às ameaças, por outras palavras, o seu alto nível de resiliência, pois dependendo do contexto onde se insere o sistema, a mínima falha poderá implicar danos significativos, seja em termos económicos, de perda de reputação, de informação, de infraestrutura, de ambiente, ou de vida humana. A segurança informática de tais sistemas está tradicionalmente associada a infraestruturas e data centers legacy, ou seja, de natureza monolítica, o que se traduz em desafios de evolução e proteção cada vez mais elevados. No contexto atual de rápida transformação, onde as variedades de ameaças aos sistemas têm vindo consistentemente a aumentar, esta dissertação visa realizar um estudo de compatibilidade da arquitetura de microserviços, que se denota pelas suas caraterísticas tais como a resiliência, escalabilidade, modificabilidade e heterogeneidade tecnológica, sendo flexível em adaptações estruturais, e em cenários de rápida evolução e elevada complexidade, tornando-a adequada a ambientes ágeis. Explora também a resposta que a inteligência artificial, mais concretamente, machine learning, pode dar num contexto de segurança e monitorabilidade quando combinado com um simples sistema bancário que adota uma arquitetura de microserviços

    AI for IT Operations (AIOps) on Cloud Platforms: Reviews, Opportunities and Challenges

    Full text link
    Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big data generated by IT Operations processes, particularly in cloud infrastructures, to provide actionable insights with the primary goal of maximizing availability. There are a wide variety of problems to address, and multiple use-cases, where AI capabilities can be leveraged to enhance operational efficiency. Here we provide a review of the AIOps vision, trends challenges and opportunities, specifically focusing on the underlying AI techniques. We discuss in depth the key types of data emitted by IT Operations activities, the scale and challenges in analyzing them, and where they can be helpful. We categorize the key AIOps tasks as - incident detection, failure prediction, root cause analysis and automated actions. We discuss the problem formulation for each task, and then present a taxonomy of techniques to solve these problems. We also identify relatively under explored topics, especially those that could significantly benefit from advances in AI literature. We also provide insights into the trends in this field, and what are the key investment opportunities
    • …
    corecore