8,492 research outputs found
Monitoring and analysis system for performance troubleshooting in data centers
It was not long ago. On Christmas Eve 2012, a war of troubleshooting began in Amazon data centers. It started at 12:24 PM, with an mistaken deletion of the state data of Amazon Elastic Load Balancing Service (ELB for short), which was
not realized at that time. The mistake first led to a local issue that a small number of ELB service APIs were affected. In about six minutes, it evolved into a critical one that EC2 customers were significantly affected. One example was that Netflix, which was using hundreds of Amazon ELB services, was experiencing an extensive streaming service outage when many customers could not watch TV shows or movies on Christmas Eve. It took Amazon engineers 5 hours 42 minutes to find the root cause, the mistaken deletion, and another 15 hours and 32 minutes to fully recover the ELB service. The war ended at 8:15 AM the next day and brought the performance
troubleshooting in data centers to world’s attention. As shown in this Amazon ELB case.Troubleshooting runtime performance issues is crucial in time-sensitive multi-tier cloud services because of their stringent end-to-end timing requirements, but it is also notoriously difficult and time consuming.
To address the troubleshooting challenge, this dissertation proposes VScope, a flexible monitoring and analysis system for online troubleshooting in data centers.
VScope provides primitive operations which data center operators can use to troubleshoot various performance issues. Each operation is essentially a series of monitoring and analysis functions executed on an overlay network. We design a novel
software architecture for VScope so that the overlay networks can be generated, executed and terminated automatically, on-demand. From the troubleshooting side, we design novel anomaly detection algorithms and implement them in VScope. By
running anomaly detection algorithms in VScope, data center operators are notified when performance anomalies happen. We also design a graph-based guidance approach, called VFocus, which tracks the interactions among hardware and software components in data centers. VFocus provides primitive operations by which operators can analyze the interactions to find out which components are relevant to the
performance issue.
VScope’s capabilities and performance are evaluated on a testbed with over 1000 virtual machines (VMs). Experimental results show that the VScope runtime negligibly perturbs system and application performance, and requires mere seconds to deploy monitoring and analytics functions on over 1000 nodes. This demonstrates VScope’s ability to support fast operation and online queries against a comprehensive set of application to system/platform level metrics, and a variety of representative analytics functions. When supporting algorithms with high computation complexity, VScope serves as a ‘thin layer’ that occupies no more than 5% of their total latency. Further, by using VFocus, VScope can locate problematic VMs that cannot be found
via solely application-level monitoring, and in one of the use cases explored in the dissertation, it operates with levels of perturbation of over 400% less than what is seen for brute-force and most sampling-based approaches. We also validate VFocus
with real-world data center traces. The experimental results show that VFocus has troubleshooting accuracy of 83% on average.Ph.D
Enhancing Failure Propagation Analysis in Cloud Computing Systems
In order to plan for failure recovery, the designers of cloud systems need to
understand how their system can potentially fail. Unfortunately, analyzing the
failure behavior of such systems can be very difficult and time-consuming, due
to the large volume of events, non-determinism, and reuse of third-party
components. To address these issues, we propose a novel approach that joins
fault injection with anomaly detection to identify the symptoms of failures. We
evaluated the proposed approach in the context of the OpenStack cloud computing
platform. We show that our model can significantly improve the accuracy of
failure analysis in terms of false positives and negatives, with a low
computational cost.Comment: 12 pages, The 30th International Symposium on Software Reliability
Engineering (ISSRE 2019
Program Analysis of Commodity IoT Applications for Security and Privacy: Challenges and Opportunities
Recent advances in Internet of Things (IoT) have enabled myriad domains such
as smart homes, personal monitoring devices, and enhanced manufacturing. IoT is
now pervasive---new applications are being used in nearly every conceivable
environment, which leads to the adoption of device-based interaction and
automation. However, IoT has also raised issues about the security and privacy
of these digitally augmented spaces. Program analysis is crucial in identifying
those issues, yet the application and scope of program analysis in IoT remains
largely unexplored by the technical community. In this paper, we study privacy
and security issues in IoT that require program-analysis techniques with an
emphasis on identified attacks against these systems and defenses implemented
so far. Based on a study of five IoT programming platforms, we identify the key
insights that result from research efforts in both the program analysis and
security communities and relate the efficacy of program-analysis techniques to
security and privacy issues. We conclude by studying recent IoT analysis
systems and exploring their implementations. Through these explorations, we
highlight key challenges and opportunities in calibrating for the environments
in which IoT systems will be used.Comment: syntax and grammar error are fixed, and IoT platforms are updated to
match with the submissio
The Role of a Microservice Architecture on cybersecurity and operational resilience in critical systems
Critical systems are characterized by their high degree of intolerance to threats, in other words,
their high level of resilience, because depending on the context in which the system is inserted,
the slightest failure could imply significant damage, whether in economic terms, or loss of
reputation, of information, of infrastructure, of the environment, or human life. The security of
such systems is traditionally associated with legacy infrastructures and data centers that are
monolithic, which translates into increasingly high evolution and protection challenges.
In the current context of rapid transformation where the variety of threats to systems has been
consistently increasing, this dissertation aims to carry out a compatibility study of the
microservice architecture, which is denoted by its characteristics such as resilience, scalability,
modifiability and technological heterogeneity, being flexible in structural adaptations, and in
rapidly evolving and highly complex settings, making it suited for agile environments. It also
explores what response artificial intelligence, more specifically machine learning, can provide
in a context of security and monitorability when combined with a simple banking system that
adopts the microservice architecture.Os sistemas crÃticos são caracterizados pelo seu elevado grau de intolerância à s ameaças, por
outras palavras, o seu alto nÃvel de resiliência, pois dependendo do contexto onde se insere o
sistema, a mÃnima falha poderá implicar danos significativos, seja em termos económicos, de
perda de reputação, de informação, de infraestrutura, de ambiente, ou de vida humana. A
segurança informática de tais sistemas está tradicionalmente associada a infraestruturas e data
centers legacy, ou seja, de natureza monolÃtica, o que se traduz em desafios de evolução e
proteção cada vez mais elevados.
No contexto atual de rápida transformação, onde as variedades de ameaças aos sistemas têm
vindo consistentemente a aumentar, esta dissertação visa realizar um estudo de
compatibilidade da arquitetura de microserviços, que se denota pelas suas caraterÃsticas tais
como a resiliência, escalabilidade, modificabilidade e heterogeneidade tecnológica, sendo
flexÃvel em adaptações estruturais, e em cenários de rápida evolução e elevada complexidade,
tornando-a adequada a ambientes ágeis. Explora também a resposta que a inteligência artificial,
mais concretamente, machine learning, pode dar num contexto de segurança e
monitorabilidade quando combinado com um simples sistema bancário que adota uma
arquitetura de microserviços
AI for IT Operations (AIOps) on Cloud Platforms: Reviews, Opportunities and Challenges
Artificial Intelligence for IT operations (AIOps) aims to combine the power
of AI with the big data generated by IT Operations processes, particularly in
cloud infrastructures, to provide actionable insights with the primary goal of
maximizing availability. There are a wide variety of problems to address, and
multiple use-cases, where AI capabilities can be leveraged to enhance
operational efficiency. Here we provide a review of the AIOps vision, trends
challenges and opportunities, specifically focusing on the underlying AI
techniques. We discuss in depth the key types of data emitted by IT Operations
activities, the scale and challenges in analyzing them, and where they can be
helpful. We categorize the key AIOps tasks as - incident detection, failure
prediction, root cause analysis and automated actions. We discuss the problem
formulation for each task, and then present a taxonomy of techniques to solve
these problems. We also identify relatively under explored topics, especially
those that could significantly benefit from advances in AI literature. We also
provide insights into the trends in this field, and what are the key investment
opportunities
- …