128 research outputs found

    Towards fast metamodel evolution in LiquidML

    Get PDF
    The software industry is applying Model-driven development approaches due to a core set of benefits, such as raising the level of abstraction and reducing coding errors. However, their underlying modeling languages tend to be quite static, making their evolution hard, specifically when the corresponding metamodel does not support primitives and/or functionalities required in specific business domains. This paper presents an extension to the LiquidML language to support fast metamodel evolution by allowing experts to abstract new language concepts from primitives while supporting automatic tool evolution and zero application downtime. To probe our claims, we evaluate the evolutionary capabilities of existing modeling languages and LiquidML in a real world language extension.Ministerio de Economía y Competitividad TIN2016-76956-C3-2-R (POLOLAS)Ministerio de Economía y Competitividad TIN2015-71938-RED

    Governance of Cloud-hosted Web Applications

    Get PDF
    Cloud computing has revolutionized the way developers implement and deploy applications. By running applications on large-scale compute infrastructures and programming platforms that are remotely accessible as utility services, cloud computing provides scalability, high availability, and increased user productivity.Despite the advantages inherent to the cloud computing model, it has also given rise to several software management and maintenance issues. Specifically, cloud platforms do not enforce developer best practices, and other administrative requirements when deploying applications. Cloud platforms also do not facilitate establishing service level objectives (SLOs) on application performance, which are necessary to ensure reliable and consistent operation of applications. Moreover, cloud platforms do not provide adequate support to monitor the performance of deployed applications, and conduct root cause analysis when an application exhibits a performance anomaly.We employ governance as a methodology to address the above mentioned issues prevalent in cloud platforms. We devise novel governance solutions that achieve administrative conformance, developer best practices, and performance SLOs in the cloud via policy enforcement, SLO prediction, performance anomaly detection and root cause analysis. The proposed solutions are fully automated, and built into the cloud platforms as cloud-native features thereby precluding the application developers from having to implement similar features by themselves. We evaluate our methodology using real world cloud platforms, and show that our solutions are highly effective and efficient

    Efficiently Conducting Quality-of-Service Analyses by Templating Architectural Knowledge

    Get PDF
    Previously, software architects were unable to effectively and efficiently apply reusable knowledge (e.g., architectural styles and patterns) to architectural analyses. This work tackles this problem with a novel method to create and apply templates for reusable knowledge. These templates capture reusable knowledge formally and can efficiently be integrated in architectural analyses

    CloudOps: Towards the Operationalization of the Cloud Continuum: Concepts, Challenges and a Reference Framework

    Get PDF
    The current trend of developing highly distributed, context aware, heterogeneous computing intense and data-sensitive applications is changing the boundaries of cloud computing. Encouraged by the growing IoT paradigm and with flexible edge devices available, an ecosystem of a combination of resources, ranging from high density compute and storage to very lightweight embedded computers running on batteries or solar power, is available for DevOps teams from what is known as the Cloud Continuum. In this dynamic context, manageability is key, as well as controlled operations and resources monitoring for handling anomalies. Unfortunately, the operation and management of such heterogeneous computing environments (including edge, cloud and network services) is complex and operators face challenges such as the continuous optimization and autonomous (re-)deployment of context-aware stateless and stateful applications where, however, they must ensure service continuity while anticipating potential failures in the underlying infrastructure. In this paper, we propose a novel CloudOps workflow (extending the traditional DevOps pipeline), proposing techniques and methods for applications’ operators to fully embrace the possibilities of the Cloud Continuum. Our approach will support DevOps teams in the operationalization of the Cloud Continuum. Secondly, we provide an extensive explanation of the scope, possibilities and future of the CloudOps.This research was funded by the European project PIACERE (Horizon 2020 Research and Innovation Programme, under grant agreement No. 101000162)

    CloudOps: Towards the Operationalization of the Cloud Continuum: Concepts, Challenges and a Reference Framework

    Get PDF
    The current trend of developing highly distributed, context aware, heterogeneous computing intense and data-sensitive applications is changing the boundaries of cloud computing. Encouraged by the growing IoT paradigm and with flexible edge devices available, an ecosystem of a combination of resources, ranging from high density compute and storage to very lightweight embedded computers running on batteries or solar power, is available for DevOps teams from what is known as the Cloud Continuum. In this dynamic context, manageability is key, as well as controlled operations and resources monitoring for handling anomalies. Unfortunately, the operation and management of such heterogeneous computing environments (including edge, cloud and network services) is complex and operators face challenges such as the continuous optimization and autonomous (re-)deployment of context-aware stateless and stateful applications where, however, they must ensure service continuity while anticipating potential failures in the underlying infrastructure. In this paper, we propose a novel CloudOps workflow (extending the traditional DevOps pipeline), proposing techniques and methods for applications’ operators to fully embrace the possibilities of the Cloud Continuum. Our approach will support DevOps teams in the operationalization of the Cloud Continuum. Secondly, we provide an extensive explanation of the scope, possibilities and future of the CloudOps.This research was funded by the European project PIACERE (Horizon 2020 Research and Innovation Programme, under grant agreement No. 101000162)

    Monitoring and analysis system for performance troubleshooting in data centers

    Get PDF
    It was not long ago. On Christmas Eve 2012, a war of troubleshooting began in Amazon data centers. It started at 12:24 PM, with an mistaken deletion of the state data of Amazon Elastic Load Balancing Service (ELB for short), which was not realized at that time. The mistake first led to a local issue that a small number of ELB service APIs were affected. In about six minutes, it evolved into a critical one that EC2 customers were significantly affected. One example was that Netflix, which was using hundreds of Amazon ELB services, was experiencing an extensive streaming service outage when many customers could not watch TV shows or movies on Christmas Eve. It took Amazon engineers 5 hours 42 minutes to find the root cause, the mistaken deletion, and another 15 hours and 32 minutes to fully recover the ELB service. The war ended at 8:15 AM the next day and brought the performance troubleshooting in data centers to world’s attention. As shown in this Amazon ELB case.Troubleshooting runtime performance issues is crucial in time-sensitive multi-tier cloud services because of their stringent end-to-end timing requirements, but it is also notoriously difficult and time consuming. To address the troubleshooting challenge, this dissertation proposes VScope, a flexible monitoring and analysis system for online troubleshooting in data centers. VScope provides primitive operations which data center operators can use to troubleshoot various performance issues. Each operation is essentially a series of monitoring and analysis functions executed on an overlay network. We design a novel software architecture for VScope so that the overlay networks can be generated, executed and terminated automatically, on-demand. From the troubleshooting side, we design novel anomaly detection algorithms and implement them in VScope. By running anomaly detection algorithms in VScope, data center operators are notified when performance anomalies happen. We also design a graph-based guidance approach, called VFocus, which tracks the interactions among hardware and software components in data centers. VFocus provides primitive operations by which operators can analyze the interactions to find out which components are relevant to the performance issue. VScope’s capabilities and performance are evaluated on a testbed with over 1000 virtual machines (VMs). Experimental results show that the VScope runtime negligibly perturbs system and application performance, and requires mere seconds to deploy monitoring and analytics functions on over 1000 nodes. This demonstrates VScope’s ability to support fast operation and online queries against a comprehensive set of application to system/platform level metrics, and a variety of representative analytics functions. When supporting algorithms with high computation complexity, VScope serves as a ‘thin layer’ that occupies no more than 5% of their total latency. Further, by using VFocus, VScope can locate problematic VMs that cannot be found via solely application-level monitoring, and in one of the use cases explored in the dissertation, it operates with levels of perturbation of over 400% less than what is seen for brute-force and most sampling-based approaches. We also validate VFocus with real-world data center traces. The experimental results show that VFocus has troubleshooting accuracy of 83% on average.Ph.D

    Analytically-Driven Resource Management for Cloud-Native Microservices

    Full text link
    Resource management for cloud-native microservices has attracted a lot of recent attention. Previous work has shown that machine learning (ML)-driven approaches outperform traditional techniques, such as autoscaling, in terms of both SLA maintenance and resource efficiency. However, ML-driven approaches also face challenges including lengthy data collection processes and limited scalability. We present Ursa, a lightweight resource management system for cloud-native microservices that addresses these challenges. Ursa uses an analytical model that decomposes the end-to-end SLA into per-service SLA, and maps per-service SLA to individual resource allocations per microservice tier. To speed up the exploration process and avoid prolonged SLA violations, Ursa explores each microservice individually, and swiftly stops exploration if latency exceeds its SLA. We evaluate Ursa on a set of representative and end-to-end microservice topologies, including a social network, media service and video processing pipeline, each consisting of multiple classes and priorities of requests with different SLAs, and compare it against two representative ML-driven systems, Sinan and Firm. Compared to these ML-driven approaches, Ursa provides significant advantages: It shortens the data collection process by more than 128x, and its control plane is 43x faster than ML-driven approaches. At the same time, Ursa does not sacrifice resource efficiency or SLAs. During online deployment, Ursa reduces the SLA violation rate by 9.0% up to 49.9%, and reduces CPU allocation by up to 86.2% compared to ML-driven approaches

    Adaptive Monitoring of Complex Software Systems using Management Metrics

    Get PDF
    Software systems supporting networked, transaction-oriented services are large and complex; they comprise a multitude of inter-dependent layers and components, and they implement many dynamic optimization mechanisms. In addition, these systems are subject to workload that is hard to predict. These factors make monitoring these systems as well as performing problem determination challenging and costly. In this thesis we tackle these challenges with the goal of lowering the cost and improving the effectiveness of monitoring and problem determination by reducing the dependence on human operators. Specifically, this thesis presents and demonstrates the effectiveness of an efficient, automated monitoring approach which enables detection of errors and failures, and which assists in localizing faults. Software systems expose various types of monitoring data; this thesis focuses on the use of management metrics to monitor a system's health. We devise a system modeling approach which entails modeling stable, statistical correlations among management metrics; these correlations characterize a system's normal behaviour This approach allows a system model to be built automatically and efficiently using the monitoring data alone. In order to control the monitoring overhead, and yet allow a system's health to be assessed reliably, we design an adaptive monitoring approach. This adaptive capability builds on the flexible nature of our system modeling approach, which allows the set of monitored metrics to be altered at runtime. We develop methods to automatically select management metrics to collect at the minimal monitoring level, without any domain knowledge. In addition, we devise an automated fault localization approach, which leverages the ability of the monitoring system to analyze individual metrics. Using a realistic, multi-tier software system, including different applications based on Java Enterprise Edition and industrial-strength products, we evaluate our system modeling approach. We show that stable metric correlations exist in complex software systems and that many of these correlations can be modeled using simple, efficient techniques. We investigate the effect of the collection of management metrics on system performance. We show that the monitoring overhead can be high and thus needs to be controlled. We employ fault injection experiments to evaluate the effectiveness of our adaptive monitoring and fault localization approach. We demonstrate that our approach is cost-effective, has high fault coverage and, in the majority of the cases studied, provides pertinent diagnosis information. The main contribution of this work is to show how to monitor complex software systems and determine problems in them automatically and efficiently. Our solution approach has wide applicability and the techniques we use are simple and yet effective. Our work suggests that the cost of monitoring software systems is not necessarily a function of their complexity, providing hope that the health of increasingly large and complex systems can be tracked with a limited amount of human resources and without sacrificing much system performance
    • …
    corecore