32 research outputs found

    Improved self-management of datacenter systems applying machine learning

    Get PDF
    Autonomic Computing is a Computer Science and Technologies research area, originated during mid 2000's. It focuses on optimization and improvement of complex distributed computing systems through self-control and self-management. As distributed computing systems grow in complexity, like multi-datacenter systems in cloud computing, the system operators and architects need more help to understand, design and optimize manually these systems, even more when these systems are distributed along the world and belong to different entities and authorities. Self-management lets these distributed computing systems improve their resource and energy management, a very important issue when resources have a cost, by obtaining, running or maintaining them. Here we propose to improve Autonomic Computing techniques for resource management by applying modeling and prediction methods from Machine Learning and Artificial Intelligence. Machine Learning methods can find accurate models from system behaviors and often intelligible explanations to them, also predict and infer system states and values. These models obtained from automatic learning have the advantage of being easily updated to workload or configuration changes by re-taking examples and re-training the predictors. So employing automatic modeling and predictive abilities, we can find new methods for making "intelligent" decisions and discovering new information and knowledge from systems. This thesis departs from the state of the art, where management is based on administrators expertise, well known data, ad-hoc studied algorithms and models, and elements to be studied from computing machine point of view; to a novel state of the art where management is driven by models learned from the same system, providing useful feedback, making up for incomplete, missing or uncertain data, from a global network of datacenters point of view. - First of all, we cover the scenario where the decision maker works knowing all pieces of information from the system: how much will each job consume, how is and will be the desired quality of service, what are the deadlines for the workload, etc. All of this focusing on each component and policy of each element involved in executing these jobs. -Then we focus on the scenario where instead of fixed oracles that provide us information from an expert formula or set of conditions, machine learning is used to create these oracles. Here we look at components and specific details while some part of the information is not known and must be learned and predicted. - We reduce the problem of optimizing resource allocations and requirements for virtualized web-services to a mathematical problem, indicating each factor, variable and element involved, also all the constraints the scheduling process must attend to. The scheduling problem can be modeled as a Mixed Integer Linear Program. Here we face an scenario of a full datacenter, further we introduce some information prediction. - We complement the model by expanding the predicted elements, studying the main resources (this is CPU, Memory and IO) that can suffer from noise, inaccuracy or unavailability. Once learning predictors for certain components let the decision making improve, the system can become more ¿expert-knowledge independent¿ and research can focus on an scenario where all the elements provide noisy, uncertainty or private information. Also we introduce to the management optimization new factors as for each datacenter context and costs may change, turning the model as "multi-datacenter" - Finally, we review of the cost of placing datacenters depending on green energy sources, and distribute the load according to green energy availability

    Modeling and analysis of high availability techniques in a virtualized system

    Get PDF
    Availability evaluation of a virtualized system is critical to the wide deployment of cloud computing services. Time-based, prediction-based rejuvenation of virtual machines (VM) and virtual machine monitors, VM failover and live VM migration are common high-availability (HA) techniques in a virtualized system. This paper investigates the effect of combination of these availability techniques on VM availability in a virtualized system where various software and hardware failures may occur. For each combination, we construct analytic models rejuvenation mechanisms to improve VM availability; (2) prediction-based rejuvenation enhances VM availability much more than time-based VM rejuvenation when prediction successful probability is above 70%, regardless failover and/or live VM migration is also deployed; (3) failover mechanism outperforms live VM migration, although they can work together for higher availability of VM. In addition, they can combine with software rejuvenation mechanisms for even higher availability; (4) and time interval setting is critical to a time-based rejuvenation mechanism. These analytic results provide guidelines for deploying and parameter setting of HA techniques in a virtualized system

    MACHS: Mitigating the Achilles Heel of the Cloud through High Availability and Performance-aware Solutions

    Get PDF
    Cloud computing is continuously growing as a business model for hosting information and communication technology applications. However, many concerns arise regarding the quality of service (QoS) offered by the cloud. One major challenge is the high availability (HA) of cloud-based applications. The key to achieving availability requirements is to develop an approach that is immune to cloud failures while minimizing the service level agreement (SLA) violations. To this end, this thesis addresses the HA of cloud-based applications from different perspectives. First, the thesis proposes a component’s HA-ware scheduler (CHASE) to manage the deployments of carrier-grade cloud applications while maximizing their HA and satisfying the QoS requirements. Second, a Stochastic Petri Net (SPN) model is proposed to capture the stochastic characteristics of cloud services and quantify the expected availability offered by an application deployment. The SPN model is then associated with an extensible policy-driven cloud scoring system that integrates other cloud challenges (i.e. green and cost concerns) with HA objectives. The proposed HA-aware solutions are extended to include a live virtual machine migration model that provides a trade-off between the migration time and the downtime while maintaining HA objective. Furthermore, the thesis proposes a generic input template for cloud simulators, GITS, to facilitate the creation of cloud scenarios while ensuring reusability, simplicity, and portability. Finally, an availability-aware CloudSim extension, ACE, is proposed. ACE extends CloudSim simulator with failure injection, computational paths, repair, failover, load balancing, and other availability-based modules

    Envelhecimento e rejuvenescimento de software: 20 anos (19952014) - panorama e desafios

    Get PDF
    Although software aging and rejuvenation is a young research held, in its first 20 years a lot of knowledge has been produced. Nowadays, important scientific journals and conferences include SAR-related topics in their scope of interest. This fast growing and wide range of dissemination venues pose a challenge to researchers to keep tracking of the new findings and trends in this area. In this work, we collected and analyzed SAR research data to detect trends, patterns, and thematic gaps, in order to provide a comprehensive view of this research held over its hrst 20 years. Adopted the systematic mapping approach to answer research questions such as: How the main topics investigated in SAR have evolved over time? Which are the most investigated aging effects? Which rejuvenation techniques and strategies are more frequently used?CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível SuperiorDissertação (Mestrado)Embora o envelhecimento e rejuvenescimento de software seja um campo de pesquisa novo, em seus primeiros 20 anos muito conhecimento foi produzido. Hoje em dia, revistas e conferências científicas importantes incluem temas relacionados a SAR no seu âmbito de interesse. Este crescimento rápido e a grande variedade de locais de disseminação representam um desafio para os pesquisadores para manter o acompanhamento das novas descobertas e tendências nesta área. Neste trabalho, foram coletados e analisados dados de pesquisa em SAR para detectar tendências, padrões e lacunas temáticas, a hm de proporcionar uma visão abrangente deste campo de pesquisa em seus primeiros 20 anos. Adotou-se a abordagem de mapeamento sistemático para responder a perguntas de pesquisa, tais como: Como os principais temas investigados em SAR têm evoluído ao longo do tempo? Quais são os efeitos do envelhecimento mais investigados? Quais técnicas e estratégias de rejuvenescimento são mais frequentemente usadas

    Systematic analysis of software development in cloud computing perceptions

    Get PDF
    Cloud computing is characterized as a shared computing and communication infrastructure. It encourages the efficient and effective developmental processes that are carried out in various organizations. Cloud computing offers both possibilities and solutions of problems for outsourcing and management of software developmental operations across distinct geography. Cloud computing is adopted by organizations and application developers for developing quality software. The cloud has the significant impact on utilizing the artificial complexity required in developing and designing quality software. Software developmental organization prefers cloud computing for outsourcing tasks because of its available and scalable nature. Cloud computing is the ideal choice utilized for development modern software as they have provided a completely new way of developing real-time cost-effective, efficient, and quality software. Tenants (providers, developers, and consumers) are provided with platforms, software services, and infrastructure based on pay per use phenomenon. Cloud-based software services are becoming increasingly popular, as observed by their widespread use. Cloud computing approach has drawn the interest of researchers and business because of its ability to provide a flexible and resourceful platform for development and deployment. To determine a cohesive understanding of the analyzed problems and solutions to improve the quality of software, the existing literature resources on cloud-based software development should be analyzed and synthesized systematically. Keyword strings were formulated for analyzing relevant research articles from journals, book chapters, and conference papers. The research articles published in (2011–2021) various scientific databases were extracted and analyzed for retrieval of relevant research articles. A total of 97 research publications are examined in this SLR and are evaluated to be appropriate studies in explaining and discussing the proposed topic. The major emphasis of the presented systematic literature review (SLR) is to identify the participating entities of cloud-based software development, challenges associated with adopting cloud for software developmental processes, and its significance to software industries and developers. This SLR will assist organizations, designers, and developers to develop and deploy user-friendly, efficient, effective, and real time software applications.Qatar University Internal Grant - No. IRCC‐2021‐010

    Workload Prediction for Efficient Performance Isolation and System Reliability

    Get PDF
    In large-scaled and distributed systems, like multi-tier storage systems and cloud data centers, resource sharing among workloads brings multiple benefits while introducing many performance challenges. The key to effective workload multiplexing is accurate workload prediction. This thesis focuses on how to capture the salient characteristics of the real-world workloads to develop workload prediction methods and to drive scheduling and resource allocation policies, in order to achieve efficient and in-time resource isolation among applications. For a multi-tier storage system, high-priority user work is often multiplexed with low-priority background work. This brings the challenge of how to strike a balance between maintaining the user performance and maximizing the amount of finished background work. In this thesis, we propose two resource isolation policies based on different workload prediction methods: one is a Markovian model-based and the other is a neural networks-based. These policies aim at, via workload prediction, discovering the opportune time to schedule background work with minimum impact on user performance. Trace-driven simulations verify the efficiency of the two pro- posed resource isolation policies. The Markovian model-based policy successfully schedules the background work at the appropriate periods with small impact on the user performance. The neural networks-based policy adaptively schedules user and background work, resulting in meeting both performance requirements consistently. This thesis also proposes an accurate while efficient neural networks-based pre- diction method for data center usage series, called PRACTISE. Different from the traditional neural networks for time series prediction, PRACTISE selects the most informative features from the past observations of the time series itself. Testing on a large set of usage series in production data centers illustrates the accuracy (e.g., prediction error) and efficiency (e.g., time cost) of PRACTISE. The superiority of the usage prediction also allows a proactive resource management in the highly virtualized cloud data centers. In this thesis, we analyze on the performance tickets in the cloud data centers, and propose an active sizing algorithm, named ATM, that predicts the usage workloads and re-allocates capacity to work- loads to avoid VM performance tickets. Moreover, driven by cheap prediction of usage tails, we also present TailGuard in this thesis, which dynamically clones VMs among co-located boxes, in order to efficiently reduce the performance violations of physical boxes in cloud data centers

    Data Replication and Its Alignment with Fault Management in the Cloud Environment

    Get PDF
    Nowadays, the exponential data growth becomes one of the major challenges all over the world. It may cause a series of negative impacts such as network overloading, high system complexity, and inadequate data security, etc. Cloud computing is developed to construct a novel paradigm to alleviate massive data processing challenges with its on-demand services and distributed architecture. Data replication has been proposed to strategically distribute the data access load to multiple cloud data centres by creating multiple data copies at multiple cloud data centres. A replica-applied cloud environment not only achieves a decrease in response time, an increase in data availability, and more balanced resource load but also protects the cloud environment against the upcoming faults. The reactive fault tolerance strategy is also required to handle the faults when the faults already occurred. As a result, the data replication strategies should be aligned with the reactive fault tolerance strategies to achieve a complete management chain in the cloud environment. In this thesis, a data replication and fault management framework is proposed to establish a decentralised overarching management to the cloud environment. Three data replication strategies are firstly proposed based on this framework. A replica creation strategy is proposed to reduce the total cost by jointly considering the data dependency and the access frequency in the replica creation decision making process. Besides, a cloud map oriented and cost efficiency driven replica creation strategy is proposed to achieve the optimal cost reduction per replica in the cloud environment. The local data relationship and the remote data relationship are further analysed by creating two novel data dependency types, Within-DataCentre Data Dependency and Between-DataCentre Data Dependency, according to the data location. Furthermore, a network performance based replica selection strategy is proposed to avoid potential network overloading problems and to increase the number of concurrent-running instances at the same time
    corecore