1,031 research outputs found

    Monitoring and analysis system for performance troubleshooting in data centers

    Get PDF
    It was not long ago. On Christmas Eve 2012, a war of troubleshooting began in Amazon data centers. It started at 12:24 PM, with an mistaken deletion of the state data of Amazon Elastic Load Balancing Service (ELB for short), which was not realized at that time. The mistake first led to a local issue that a small number of ELB service APIs were affected. In about six minutes, it evolved into a critical one that EC2 customers were significantly affected. One example was that Netflix, which was using hundreds of Amazon ELB services, was experiencing an extensive streaming service outage when many customers could not watch TV shows or movies on Christmas Eve. It took Amazon engineers 5 hours 42 minutes to find the root cause, the mistaken deletion, and another 15 hours and 32 minutes to fully recover the ELB service. The war ended at 8:15 AM the next day and brought the performance troubleshooting in data centers to world’s attention. As shown in this Amazon ELB case.Troubleshooting runtime performance issues is crucial in time-sensitive multi-tier cloud services because of their stringent end-to-end timing requirements, but it is also notoriously difficult and time consuming. To address the troubleshooting challenge, this dissertation proposes VScope, a flexible monitoring and analysis system for online troubleshooting in data centers. VScope provides primitive operations which data center operators can use to troubleshoot various performance issues. Each operation is essentially a series of monitoring and analysis functions executed on an overlay network. We design a novel software architecture for VScope so that the overlay networks can be generated, executed and terminated automatically, on-demand. From the troubleshooting side, we design novel anomaly detection algorithms and implement them in VScope. By running anomaly detection algorithms in VScope, data center operators are notified when performance anomalies happen. We also design a graph-based guidance approach, called VFocus, which tracks the interactions among hardware and software components in data centers. VFocus provides primitive operations by which operators can analyze the interactions to find out which components are relevant to the performance issue. VScope’s capabilities and performance are evaluated on a testbed with over 1000 virtual machines (VMs). Experimental results show that the VScope runtime negligibly perturbs system and application performance, and requires mere seconds to deploy monitoring and analytics functions on over 1000 nodes. This demonstrates VScope’s ability to support fast operation and online queries against a comprehensive set of application to system/platform level metrics, and a variety of representative analytics functions. When supporting algorithms with high computation complexity, VScope serves as a ‘thin layer’ that occupies no more than 5% of their total latency. Further, by using VFocus, VScope can locate problematic VMs that cannot be found via solely application-level monitoring, and in one of the use cases explored in the dissertation, it operates with levels of perturbation of over 400% less than what is seen for brute-force and most sampling-based approaches. We also validate VFocus with real-world data center traces. The experimental results show that VFocus has troubleshooting accuracy of 83% on average.Ph.D

    A Guide for selecting big data analytics tools in an organisation

    Get PDF
    Selection of appropriate big data analytics (BDA) tools (software) for business purposes is increasingly challenging, which sometimes lead to incompatibility with existing technologies. This becomes prohibitive in attempts to execute some functions or activities in an environment. The objective of this study was to propose a model, which can be used to guide the selection of BDA in an organization. The interpretivist approach was employed. Qualitative data was collected and analyzed using the hermeneutics approach. The analysis focused on examining and gaining better understanding of the strengths and weaknesses of the most common BDA tools. The technical and non-technical factors that influence the selection of BDA were identified. Based on which a solution is proposed in the form of a model. The model is intended to guide selection of most appropriate BDA tools in an organization. The model is intended to increase BDA usefulness towards improving organization’s competitiveness

    Workload-sensitive Timing Behavior Analysis for Fault: Localization in Software Systems

    Get PDF
    Software timing behavior measurements, such as response times, often show high statistical variance. This variance can make the analysis difficult or even threaten the applicability of statistical techniques. This thesis introduces a method for improving the analysis of software response time measurements that show high variance. Our approach can find relations between timing behavior variance and both trace shape information and workload intensity information. This relation is used to provide timing behavior measurements with virtually less variance. This can make timing behavior analysis more robust (e.g., improved confidence and precision) and faster (e.g., less simulation runs and shorter monitoring period). The thesis contributes TracSTA (Trace-Context-Sensitive Timing Behavior Analysis) and WiSTA (Workload-Intensity-Sensitive Timing Behavior Analysis). TracSTA uses trace shape information (i.e., the shape of the control flow corresponding to a software operation execution) and WiSTA uses workload intensity metrics (e.g., the number of concurrent software executions) to create context-specific timing behavior profiles. Both the applicability and effectiveness are evaluated in several case studies and field studies. The evaluation shows a strong relation between timing behavior and the metrics considered by TracSTA and WiSTA. Additionally, a fault localization approach for enterprise software systems is presented as application scenario. It uses the timing behavior data provided by TracSTA and WiSTA for anomaly detection.Die Analyse von Zeitverhalten wie z.B. Antwortzeiten von Software-Operationen ist oft schwierig wegen der hohen statistischen Varianz. Diese Varianz gefährdet sogar die Anwendbarkeit von statistischen Verfahren. In dieser Arbeit wird eine Methode zur Verbesserung der Analyse von Antwortzeiten mit hoher statistischer Varianz vorgestellt. Der vorgestellte Ansatz ist in der Lage, einen Teil der Varianz aus dem gemessenen Zeitverhalten anhand von Aufrufsequenzen und Schwankungen in der Nutzungsintensität zu erklären. Dadurch kann praktisch Varianz aus den Messdaten entfernt werden, was die Anwendbarkeit von statistischen Analysen in Bezug auf Verlässlichkeit, Präzision und Geschwindigkeit (z.B. kürzere Messperiode und Simulationsdauer) verbessern kann. Der Hauptbeitrag dieser Arbeit liegt in den zwei Verfahren TracSTA (Trace-Context-Sensitive Timing Behavior Analysis) und WiSTA (Workload-Intensity-Sensitive Timing Behavior Analysis). TracSTA verwendet die Form des Aufrufflusses (d.h. die Form der Aufrufsequenz, in die ein Methodenaufruf eingebettet ist). WiSTA wertet die Nutzungsintensität aus (z.B. Anzahl gleichzeitig ausgeführter Methoden). Dies resultiert in kontextspezifischen Antwortzeitprofilen. In mehreren Fall- und Feldstudien wird die Anwendbarkeit und die Wirksamkeit evaluiert. Es zeigt sich ein deutlicher Zusammenhang zwischen dem Zeitverhalten und den von TracSTA und WiSTA betrachteten Einflussfaktoren. Zusätzlich wird als Anwendungsszenario ein Ansatz zur Fehlerlokalisierung vorgestellt, welcher von TracSTA und WiSTA bereitgestellte Antwortzeiten zur Anomalieerkennung verwendet

    Building and monitoring an event-driven microservices ecosystem

    Get PDF
    Throughout the years, software architectures have evolved deeply to attempt to address the main issues that have been emerging, mainly due to the ever-changing market needs. The need to provide a way for organizations and teams to build applications independently and with greater agility and speed led to the adoption of microservices, particularly endorsing an asynchronous methodology of communication between them via events. Moreover, the evergrowing demands for high-quality resilient and highly available systems helped pave the path towards a greater focus on strict quality measures, particularly monitoring and other means of assuring the well-functioning of components in production in real-time. Although techniques like logging, monitoring, and alerting are essential to be employed for each microservice, it may not be enough considering an event-driven architecture. Studies have shown that although organizations have been adopting this type of software architecture, they still struggle with the lack of visibility into end-to-end business processes that span multiple microservices. This thesis explores how to guarantee observability over such architecture, thus keeping track of the business processes. It shall do so by providing a tool that facilitates the analysis of the current situation of the ecosystem, as well as allow to view and possibly act upon the data. Two solutions have been explored and are therefore presented thoroughly, alongside a detailed comparison with the purpose of drawing conclusions and providing some guidance to the readers. These outcomes that were produced by the thesis resulted in a paper published and registered to be presented at this year’s edition of the SEI hosted at ISEP.Ao longo dos últimos anos, as arquiteturas de software têm evoluído significativamente de forma a tentar resolver os principais problemas que têm surgindo, principalmente derivados nas necessidades do mercado que estão em constante mudança. A necessidade de providenciar uma forma das organizações e suas equipas construírem aplicações independentemente e com uma maior agilidade e rapidez levou à adoção de microserviços, geralmente aplicando uma metodologia de comunicação assíncrona através de eventos. Para além disso, a constante evolução da necessidade de ter sistemas de qualidade e altamente resilientes e disponíveis, ajudou a direcionar um maior foco para padrões de qualidade mais rigorosos, particularmente no que toca a monitorização e outros meios para assegurar o correto funcionamento de componentes em produção em tempo-real. Embora técnicas como a produção de logs, monitorização e alarmística sejam essenciais para ser aplicadas a cada microserviço, poderá não ser suficiente quando consideramos uma arquitetura baseada em eventos. Estudos recentes apontam para que organizações, apesar de estarem a adotar cada vez mais este tipo de arquiteturas de software, ainda encontram bastantes dificuldades devido à falta de visibilidade que possuem dos processos de negócio que envolvem e se propagam por diversos microserviços. Esta tese explora como garantir visibilidade sobre uma arquitetura como a descrita, e assim conseguir seguir os processos de negócio. O resultado da mesma deverá atender a isso providenciando uma ferramenta que facilita a análise da situação atual do ecossistema, e que possibilita a visualização e a intervenção sobre os dados que são disponibilizados. Foram desenvolvidas duas soluções que serão apresentadas detalhadamente juntamente com uma comparação entre as duas com o propósito de tirar mais conclusões e providenciar alguma orientação ao leitor. A tese originou a criação de um artigo submetido para ser apresentado na edição deste ano do SEI

    Automation of Cellular Network Faults

    Get PDF

    Coordinated Fault-Tolerance for High-Performance Computing Final Project Report

    Full text link
    corecore