6 research outputs found

    Agile information technology service management with DevOps: An incident management case study

    Get PDF
    This research aims to investigate how DevOps culture can be applied in the incident management process. The authors believe, based on experience as practitioners, that agile software development methodologies are fair enough to be used on Incident Management process, to quickly restore the business interruption. An application management team which solves incidents and applies DevOps practices was studied. Three data collection methods were used: interviews, document analysis and observation. This research provides novel findings supported by metrics and real experience implementing DevOps practices in incident management process. The novelty of the findings brings advantages for academics, and due to the exploratory nature of this research, it extends the body of knowledge. It also provides contributions for practitioners, by showing how these practices can be applied and the result of the implementation of these practices. Directions of future work are also presented.info:eu-repo/semantics/acceptedVersio

    DevOps practices in incident management process

    Get PDF
    This research aims to investigate how DevOps culture can be applied in the incident management process to improve it. Given the exploratory approach of the research, it was performed a case study. For this case study an application management team was studied where a sample of 10 persons were interviewed. This team solves incidents and provides the necessary support to the users in their daily business tasks using DevOps practices. During this case study three data collection methods were used: semi structured interviews, document analysis and observation. This research provides novel findings about a possible relation between DevOps practices and incident management phases as well as on “why” and “how” can these practices help incident management. The results are supported by metrics, like time between releases, total of over delivered incidents solutions and releases per month, to justify how this team’s performance have increased after the implementation of DevOps practices. The novelty of the findings brings advantages for academics, and due to the exploratory nature of this research, it extends the body of knowledge. It also provides contributions for practitioners, by showing how these practices can be applied and the result of the implementation of these practices. Directions of future work are also presented.O objetivo desta pesquisa é investigar como a cultura DevOps pode ser aplicada ao processo de gestão de incidentes e como pode melhorá-lo. Dada a abordagem exploratória para esta pesquisa, foi feito um caso de estudo. O objeto de estudo para esta pesquisa, foi uma equipa de gestão aplicacional em gestão de incidentes, onde um conjunto de 10 pessoas foi entrevistado. Esta equipa resolve incidentes e fornece o suporte necessário aos utilizadores de negócio, nas suas tarefas do dia a dia, utilizando práticas DevOps. Durante a elaboração deste caso de estudo, foi feita a triangulação de três métodos de recolha de dados: entrevistas semiestruturadas, análise documental e observação. Esta pesquisa fornece novas conclusões sobre uma possível relação entre práticas de DevOps e as fases do processo de gestão de incidentes, tal como o “porquê” e o “como” estas práticas podem ajudar o processo de gestão de incidentes. São apresentados resultados, como o tempo entre entregas, total de soluções de incidentes entregues a mais do que estava planeado e o número de entregas por mês, de forma a justificar como existiu uma melhoria de desempenho desta equipa após a implementação destas práticas. As conclusões que são apresentadas nesta pesquisa trazem vantagens tanto para académicos devido à natureza exploratória deste estudo que estende o corpo de conhecimento científico. E também para profissionais, por demonstrar como aplicar estas práticas e os seus resultados após implementação. Direções para trabalho futuro são também apresentadas

    Model-Driven Machine Learning for Predictive Cloud Auto-scaling

    Get PDF
    Cloud provisioning of resources requires continuous monitoring and analysis of the workload on virtual computing resources. However, cloud providers offer the rule-based and schedule-based auto-scaling service. Auto-scaling is a cloud system that reacts to real-time metrics and adjusts service instances based on predefined scaling policies. The challenge of this reactive approach to auto-scaling is to cope with fluctuating load changes. For data management applications, the workload is changing and needs forecasting on historical trends and integrating with auto-scaling service. We aim to discover changes and patterns on multi metrics of resource usages of CPU, memory, and networking. To address this problem, the learning-and-inference based prediction has been adopted to predict the needs prior to provision action. First, we develop a novel machine learning-based auto-scaling process that covers the technique of learning multiple metrics for cloud auto-scaling decision. This technique is used for continuous model training and workload forecasting. Furthermore, the result of workload forecasting triggers the auto-scaling process automatically. Also, we build the serverless functions of this machine learning-based process, including monitoring, machine learning, model selection, scheduling as microservices and orchestrating these independent services by platform, language orthogonal APIs. We demonstrate this architectural implementation on AWS and Microsoft Azure and show the prediction results from machine learning on-the-fly. Results show significant cost reductions by our proposed solution compared to a general threshold-based auto-scaling. Still, there is a need to integrate the machine learning prediction with the auto-scaling system. So, the deployment effort of devising additional machine learning components is increased. So, we present a model-driven framework that defines first-class entities to represent machine learning algorithm types, inputs, outputs, parameters, and evaluation scores. We set up rules for validating machine learning entities. The connection between the machine learning and auto-scaling system is presented by two levels of abstraction models, namely cloud platform independent model and cloud platform specific model. We automate the model-to-model transformation and model-to-deployment transformation. We integrate model-driven with a DevOps approach to make models deployable and executable on a target cloud platform. We demonstrate our method with scaling configuration and deployment of two open source benchmark applications - Dell DVD store and Netflix (NDBench) on three cloud platforms, AWS, Azure, and Rackspace. The evaluation shows our inference-based auto-scaling with model-driven reduces approximately 27% of deployment effort compared to the ordinary auto-scaling

    Artificial intelligence driven anomaly detection for big data systems

    Get PDF
    The main goal of this thesis is to contribute to the research on automated performance anomaly detection and interference prediction by implementing Artificial Intelligence (AI) solutions for complex distributed systems, especially for Big Data platforms within cloud computing environments. The late detection and manual resolutions of performance anomalies and system interference in Big Data systems may lead to performance violations and financial penalties. Motivated by this issue, we propose AI-based methodologies for anomaly detection and interference prediction tailored to Big Data and containerized batch platforms to better analyze system performance and effectively utilize computing resources within cloud environments. Therefore, new precise and efficient performance management methods are the key to handling performance anomalies and interference impacts to improve the efficiency of data center resources. The first part of this thesis contributes to performance anomaly detection for in-memory Big Data platforms. We examine the performance of Big Data platforms and justify our choice of selecting the in-memory Apache Spark platform. An artificial neural network-driven methodology is proposed to detect and classify performance anomalies for batch workloads based on the RDD characteristics and operating system monitoring metrics. Our method is evaluated against other popular machine learning algorithms (ML), as well as against four different monitoring datasets. The results prove that our proposed method outperforms other ML methods, typically achieving 98–99% F-scores. Moreover, we prove that a random start instant, a random duration, and overlapped anomalies do not significantly impact the performance of our proposed methodology. The second contribution addresses the challenge of anomaly identification within an in-memory streaming Big Data platform by investigating agile hybrid learning techniques. We develop TRACK (neural neTwoRk Anomaly deteCtion in sparK) and TRACK-Plus, two methods to efficiently train a class of machine learning models for performance anomaly detection using a fixed number of experiments. Our model revolves around using artificial neural networks with Bayesian Optimization (BO) to find the optimal training dataset size and configuration parameters to efficiently train the anomaly detection model to achieve high accuracy. The objective is to accelerate the search process for finding the size of the training dataset, optimizing neural network configurations, and improving the performance of anomaly classification. A validation based on several datasets from a real Apache Spark Streaming system is performed, demonstrating that the proposed methodology can efficiently identify performance anomalies, near-optimal configuration parameters, and a near-optimal training dataset size while reducing the number of experiments up to 75% compared with naïve anomaly detection training. The last contribution overcomes the challenges of predicting completion time of containerized batch jobs and proactively avoiding performance interference by introducing an automated prediction solution to estimate interference among colocated batch jobs within the same computing environment. An AI-driven model is implemented to predict the interference among batch jobs before it occurs within system. Our interference detection model can alleviate and estimate the task slowdown affected by the interference. This model assists the system operators in making an accurate decision to optimize job placement. Our model is agnostic to the business logic internal to each job. Instead, it is learned from system performance data by applying artificial neural networks to establish the completion time prediction of batch jobs within the cloud environments. We compare our model with three other baseline models (queueing-theoretic model, operational analysis, and an empirical method) on historical measurements of job completion time and CPU run-queue size (i.e., the number of active threads in the system). The proposed model captures multithreading, operating system scheduling, sleeping time, and job priorities. A validation based on 4500 experiments based on the DaCapo benchmarking suite was carried out, confirming the predictive efficiency and capabilities of the proposed model by achieving up to 10% MAPE compared with the other models.Open Acces

    An Integrated Modeling Framework for Managing the Deployment and Operation of Cloud Applications

    Get PDF
    Cloud computing can help Software as a Service (SaaS) providers to take advantage of the sheer number of cloud benefits such as, agility, continuity, cost reduction, autonomy, and easy management of resources. To reap the benefits, SaaS providers should create their applications to utilize the cloud platform capabilities. However, this is a daunting task. First, it requires a full understanding of the service offerings from different providers, and the meta-data artifacts required by each provider to configure the platform to efficiently deploy, run and manage the application. Second, it involves complex decisions that are specified by different stakeholders. Examples include, financial decisions (e.g., selecting a platform to reduces costs), architectural decisions (e.g., partition the application to maximize scalability), and operational decisions (e.g., distributing modules to insure availability and porting the application to other platforms). Finally, while each stakeholder may conduct a certain type of change to address a specific concern, the impact of a change may span multiple models and influence the decisions of several stakeholders. These factors motivate the need for: (i) a new architectural view model that focuses on service operation and reflects the cloud stakeholder perspectives, and (ii) a novel framework that facilitates providing holistic as well as partial architectural views, and generating the required platform artifacts by fragmenting the model into artifacts that can be easily modified separately. This PhD research devises a novel architecture framework, "The 5+1 Architectural View Model", for cloud applications, in which each view corresponds to a different perspective on cloud application deployment. The architectural framework is realized as a cloud modeling framework, called "StratusML", which consists of a modeling language that uses layers to specify the cloud configuration space, and a transformation engine to generate the configuration space artifacts. The usefulness and practical applicability of StratusML to model multi-cloud and multi-tenant applications have been demonstrated though a representative domain example. Moreover, to automate the framework evolution as new concerns and cloud platforms emerge, this research work introduces also a novel schema matching technique, called "Liberate". Liberate supports the process of domain model creation, evolution, and transformations. Liberate helps solve the vendor lock-in problem by reducing the manual efforts required to map complex correspondences between cloud schemas whose domain concepts do not share linguistic similarities. The evaluation of Liberate shows its superiority in the cloud domain over existing schema matching approaches
    corecore