515 research outputs found

    Resource management of replicated service systems provisioned in the cloud

    Get PDF
    Service providers seek scalable and cost-effective cloud solutions for hosting their applications. Despite significant recent advances facilitating the deployment and management of services on cloud platforms, a number of challenges still remain. Service providers are confronted with time-varying requests for the provided applications, inter- dependencies between different components, performance variability of the procured virtual resources, and cost structures that differ from conventional data centers. Moreover, fulfilling service level agreements, such as the throughput and response time percentiles, becomes of paramount importance for ensuring business advantages.In this thesis, we explore service provisioning in clouds from multiple points of view. The aim is to best provide service replicas in the form of VMs to various service applications, such that their tail throughput and tail response times, as well as resource utilization, meet the service level agreements in the most cost effective manner. In particular, we develop models, algorithms and replication strategies that consider multi-tier composed services provisioned in clouds. We also investigate how a service provider can opportunistically take advantage of observed performance variability in the cloud. Finally, we provide means of guaranteeing tail throughput and response times in the face of performance variability of VMs, using Markov chain modeling and large deviation theory. We employ methods from analytical modeling, event-driven simulations and experiments. Overall, this thesis provides not only a multi-faceted approach to exploring several crucial aspects of hosting services in clouds, i.e., cost, tail throughput, and tail response times, but our proposed resource management strategies are also rigorously validated via trace-driven simulation and extensive experiment

    Review and Analysis of Failure Detection and Prevention Techniques in IT Infrastructure Monitoring

    Get PDF
    Maintaining the health of IT infrastructure components for improved reliability and availability is a research and innovation topic for many years. Identification and handling of failures are crucial and challenging due to the complexity of IT infrastructure. System logs are the primary source of information to diagnose and fix failures. In this work, we address three essential research dimensions about failures, such as the need for failure handling in IT infrastructure, understanding the contribution of system-generated log in failure detection and reactive & proactive approaches used to deal with failure situations. This study performs a comprehensive analysis of existing literature by considering three prominent aspects as log preprocessing, anomaly & failure detection, and failure prevention. With this coherent review, we (1) presume the need for IT infrastructure monitoring to avoid downtime, (2) examine the three types of approaches for anomaly and failure detection such as a rule-based, correlation method and classification, and (3) fabricate the recommendations for researchers on further research guidelines. As far as the authors\u27 knowledge, this is the first comprehensive literature review on IT infrastructure monitoring techniques. The review has been conducted with the help of meta-analysis and comparative study of machine learning and deep learning techniques. This work aims to outline significant research gaps in the area of IT infrastructure failure detection. This work will help future researchers understand the advantages and limitations of current methods and select an adequate approach to their problem

    Event-driven Temporal Models for Explanations - ETeMoX: Explaining Reinforcement Learning

    Get PDF
    Modern software systems are increasingly expected to show higher degrees of autonomy and self-management to cope with uncertain and diverse situations. As a consequence, autonomous systems can exhibit unexpected and surprising behaviours. This is exacerbated due to the ubiquity and complexity of Artificial Intelligence (AI)-based systems. This is the case of Reinforcement Learning (RL), where autonomous agents learn through trial-and-error how to find good solutions to a problem. Thus, the underlying decision-making criteria may become opaque to users that interact with the system and who may require explanations about the system’s reasoning. Available work for eXplainable Reinforcement Learning (XRL) offers different trade-offs: e.g. for runtime explanations, the approaches are model-specific or can only analyse results after-the-fact. Different from these approaches, this paper aims to provide an online model-agnostic approach for XRL towards trustworthy and understandable AI. We present ETeMoX, an architecture based on temporal models to keep track of the decision-making processes of RL systems. In cases where the resources are limited (e.g. storage capacity or time to response), the architecture also integrates complex event processing, an event-driven approach, for detecting matches to event patterns that need to be stored, instead of keeping the entire history. The approach is applied to a mobile communications case study that uses RL for its decision-making. In order to test the generalisability of our approach, three variants of the underlying RL algorithms are used: Q-Learning, SARSA and DQN. The encouraging results show that using the proposed configurable architecture, RL developers are able to obtain explanations about the evolution of a metric, relationships between metrics, and were able to track situations of interest happening over time windows

    A Quality of Service Monitoring System for Service Level Agreement Verification

    Get PDF
    Service-level-agreement (SLA) monitoring measures network Quality-of-Service (QoS) parameters to evaluate whether the service performance complies with the SLAs. It is becoming increasingly important for both Internet service providers (ISPs) and their customers. However, the rapid expansion of the Internet makes SLA monitoring a challenging task. As an efficient method to reduce both complexity and overheads for QoS measurements, sampling techniques have been used in SLA monitoring systems. In this thesis, I conduct a comprehensive study of sampling methods for network QoS measurements. I develop an efficient sampling strategy, which makes the measurements less intrusive and more efficient, and I design a network performance monitoring software, which monitors such QoS parameters as packet delay, packet loss and jitter for SLA monitoring and verification. The thesis starts with a discussion on the characteristics of QoS metrics related to the design of the monitoring system and the challenges in monitoring these metrics. Major measurement methodologies for monitoring these metrics are introduced. Existing monitoring systems can be broadly classified into two categories: active and passive measurements. The advantages and disadvantages of both methodologies are discussed and an active measurement methodology is chosen to realise the monitoring system. Secondly, the thesis describes the most common sampling techniques, such as systematic sampling, Poisson sampling and stratified random sampling. Theoretical analysis is performed on the fundamental limits of sampling accuracy. Theoretical analysis is also conducted on the performance of the sampling techniques, which is validated using simulation with real traffic. Both theoretical analysis and simulation results show that the stratified random sampling with optimum allocation achieves the best performance, compared with the other sampling methods. However, stratified sampling with optimum allocation requires extra statistics from the parent traffic traces, which cannot be obtained in real applications. In order to overcome this shortcoming, a novel adaptive stratified sampling strategy is proposed, based on stratified sampling with optimum allocation. A least-mean-square (LMS) linear prediction algorithm is employed to predict the required statistics from the past observations. Simulation results show that the proposed adaptive stratified sampling method closely approaches the performance of the stratified sampling with optimum allocation. Finally, a detailed introduction to the SLA monitoring software design is presented. Measurement results are displayed which calibrate systematic error in the measurements. Measurements between various remote sites have demonstrated impressively good QoS provided by Australian ISPs for premium services

    A Quality of Service Monitoring System for Service Level Agreement Verification

    Get PDF
    Service-level-agreement (SLA) monitoring measures network Quality-of-Service (QoS) parameters to evaluate whether the service performance complies with the SLAs. It is becoming increasingly important for both Internet service providers (ISPs) and their customers. However, the rapid expansion of the Internet makes SLA monitoring a challenging task. As an efficient method to reduce both complexity and overheads for QoS measurements, sampling techniques have been used in SLA monitoring systems. In this thesis, I conduct a comprehensive study of sampling methods for network QoS measurements. I develop an efficient sampling strategy, which makes the measurements less intrusive and more efficient, and I design a network performance monitoring software, which monitors such QoS parameters as packet delay, packet loss and jitter for SLA monitoring and verification. The thesis starts with a discussion on the characteristics of QoS metrics related to the design of the monitoring system and the challenges in monitoring these metrics. Major measurement methodologies for monitoring these metrics are introduced. Existing monitoring systems can be broadly classified into two categories: active and passive measurements. The advantages and disadvantages of both methodologies are discussed and an active measurement methodology is chosen to realise the monitoring system. Secondly, the thesis describes the most common sampling techniques, such as systematic sampling, Poisson sampling and stratified random sampling. Theoretical analysis is performed on the fundamental limits of sampling accuracy. Theoretical analysis is also conducted on the performance of the sampling techniques, which is validated using simulation with real traffic. Both theoretical analysis and simulation results show that the stratified random sampling with optimum allocation achieves the best performance, compared with the other sampling methods. However, stratified sampling with optimum allocation requires extra statistics from the parent traffic traces, which cannot be obtained in real applications. In order to overcome this shortcoming, a novel adaptive stratified sampling strategy is proposed, based on stratified sampling with optimum allocation. A least-mean-square (LMS) linear prediction algorithm is employed to predict the required statistics from the past observations. Simulation results show that the proposed adaptive stratified sampling method closely approaches the performance of the stratified sampling with optimum allocation. Finally, a detailed introduction to the SLA monitoring software design is presented. Measurement results are displayed which calibrate systematic error in the measurements. Measurements between various remote sites have demonstrated impressively good QoS provided by Australian ISPs for premium services

    Coordinating Resource Use in Open Distributed Systems

    Get PDF
    In an open distributed system, computational resources are peer-owned, and distributed over time and space. The system is open to interactions with its environment, and the resources can dynamically join or leave the system, or can be discovered at runtime. This dynamicity leads to opportunities to carry out computations without statically owned resources, harnessing the collective compute power of the resources connected by the Internet. However, realizing this potential requires efficient and scalable resource discovery, coordination, and control, which present challenges in a dynamic, open environment. In this thesis, I present an approach to address these challenges by separating the functionality concerns of concurrent computations from those of coordinating their resource use, with the purpose of reducing programming complexity, and aiding development of correct, efficient, and resource-aware concurrent programs. As a first step towards effectively coordinating distributed resources, I developed DREAM, a Distributed Resource Estimation and Allocation Model, which enables computations to reason about future availability of resources. I then developed a fine-grained resource coordination scheme for distributed computations. The coordination scheme integrates DREAM-based resource reasoning into a distributed scheduler, for deciding and enforcing fine-grained resource-use schedules for distributed computations. To control the overhead caused by the coordination, a tuner is implemented which explicitly balances the overhead of the control mechanisms against the extent of control exercised. The effectiveness and performance of the resource coordination approach have been evaluated using a number of case studies. Experimental results show that the approach can effectively schedule computations for supporting various types of coordination objectives, such as ensuring Quality-of-Service, power-efficient execution, and dynamic load balancing. The overhead caused by the coordination mechanism is relatively modest, and adjustable through the tuner. In addition, the coordination mechanism does not add extra programming complexity to computations

    Contributions to Securing Software Updates in IoT

    Get PDF
    The Internet of Things (IoT) is a large network of connected devices. In IoT, devices can communicate with each other or back-end systems to transfer data or perform assigned tasks. Communication protocols used in IoT depend on target applications but usually require low bandwidth. On the other hand, IoT devices are constrained, having limited resources, including memory, power, and computational resources. Considering these limitations in IoT environments, it is difficult to implement best security practices. Consequently, network attacks can threaten devices or the data they transfer. Thus it is crucial to react quickly to emerging vulnerabilities. These vulnerabilities should be mitigated by firmware updates or other necessary updates securely. Since IoT devices usually connect to the network wirelessly, such updates can be performed Over-The-Air (OTA). This dissertation presents contributions to enable secure OTA software updates in IoT. In order to perform secure updates, vulnerabilities must first be identified and assessed. In this dissertation, first, we present our contribution to designing a maturity model for vulnerability handling. Next, we analyze and compare common communication protocols and security practices regarding energy consumption. Finally, we describe our designed lightweight protocol for OTA updates targeting constrained IoT devices. IoT devices and back-end systems often use incompatible protocols that are unable to interoperate securely. This dissertation also includes our contribution to designing a secure protocol translator for IoT. This translation is performed inside a Trusted Execution Environment (TEE) with TLS interception. This dissertation also contains our contribution to key management and key distribution in IoT networks. In performing secure software updates, the IoT devices can be grouped since the updates target a large number of devices. Thus, prior to deploying updates, a group key needs to be established among group members. In this dissertation, we present our designed secure group key establishment scheme. Symmetric key cryptography can help to save IoT device resources at the cost of increased key management complexity. This trade-off can be improved by integrating IoT networks with cloud computing and Software Defined Networking (SDN).In this dissertation, we use SDN in cloud networks to provision symmetric keys efficiently and securely. These pieces together help software developers and maintainers identify vulnerabilities, provision secret keys, and perform lightweight secure OTA updates. Furthermore, they help devices and systems with incompatible protocols to be able to interoperate

    Monitoring and Optimization of ATLAS Tier 2 Center GoeGrid

    Get PDF
    The demand on computational and storage resources is growing along with the amount of infor- mation that needs to be processed and preserved. In order to ease the provisioning of the digital services to the growing number of consumers, more and more distributed computing systems and platforms are actively developed and employed. The building block of the distributed computing infrastructure are single computing centers, similar to the Worldwide LHC Computing Grid, Tier 2 centre GoeGrid. The main motivation of this thesis was the optimization of GoeGrid perfor- mance by efficient monitoring. The goal has been achieved by means of the GoeGrid monitoring information analysis. The data analysis approach was based on the adaptive-network-based fuzzy inference system (ANFIS) and machine learning algorithm such as Linear Support Vector Machine (SVM). The main object of the research was the digital service, since availability, reliability and ser- viceability of the computing platform can be measured according to the constant and stable provisioning of the services. Due to the widely used concept of the service oriented architecture (SOA) for large computing facilities, in advance knowing of the service state as well as the quick and accurate detection of its disability allows to perform the proactive management of the com- puting facility. The proactive management is considered as a core component of the computing facility management automation concept, such as Autonomic Computing. Thus in time as well as in advance and accurate identification of the provided service status can be considered as a contribution to the computing facility management automation, which is directly related to the provisioning of the stable and reliable computing resources. Based on the case studies, performed using the GoeGrid monitoring data, consideration of the approaches as generalized methods for the accurate and fast identification and prediction of the service status is reasonable. Simplicity and low consumption of the computing resources allow to consider the methods in the scope of the Autonomic Computing component
    • …
    corecore