347 research outputs found

    Handling uncertainty in cloud resource management using fuzzy Bayesian networks

    Full text link
    © 2015 IEEE. The success of cloud services depends critically on the effective management of virtualized resources. This paper aims to design and implement a decision support method to handle uncertainties in resource management from the cloud provider perspective that enables underlying complexity, automates resource provisioning and controls client-perceived quality of service. The paper includes a probabilistic decision making module that relies upon a fuzzy Bayesian network to determine the current situation status of a cloud infrastructure, including physical and virtual machines, and predicts the near future state, that will help the hypervisor to migrate or expand the VMs to reduce execution time and meet quality of service requirements. First, the framework of resource management is presented. Second, the decision making module is developed. Lastly, a series of experiments to investigate the performance of the proposed module is implemented. Experiments reveal the efficiency of the module prototype

    A Survey on Automatic Parameter Tuning for Big Data Processing Systems

    Get PDF
    Big data processing systems (e.g., Hadoop, Spark, Storm) contain a vast number of configuration parameters controlling parallelism, I/O behavior, memory settings, and compression. Improper parameter settings can cause significant performance degradation and stability issues. However, regular users and even expert administrators grapple with understanding and tuning them to achieve good performance. We investigate existing approaches on parameter tuning for both batch and stream data processing systems and classify them into six categories: rule-based, cost modeling, simulation-based, experiment-driven, machine learning, and adaptive tuning. We summarize the pros and cons of each approach and raise some open research problems for automatic parameter tuning.Peer reviewe

    SLA violation prediction : a machine learning perspective

    Get PDF
    Le cloud computing réduit les coûts de maintenance des services et permet aux utilisateurs d'accéder à la demande aux services sans devoir être impliqués dans des détails techniques d'implémentation. Le lien entre un fournisseur de services cloud et un client est régi par une Validation du Niveau Service (VNS) qui définit pour chaque service le niveau et le coût associé. La VNS contient habituellement des paramètres spécifiques et un niveau minimum de qualité pour chaque élément du service qui est négocié entre les deux parties. Cependant, une ou plusieurs des conditions convenues dans une VNS pourraient être violées en raison de plusieurs problèmes tels que des problèmes techniques occasionnels. Du point de vue d'apprentissage automatique, le problème de la prédiction de violation de la VNS équivaut à un problème de classification binaire. Nous avons exploré deux modèles de classification en apprentissage automatique lors de cette thèse. Il s’agit des modèles de classification de Bayes naïve et de Forêts Aléatoires afin de prédire des violations futures d’une certaine tâche utilisant ses traits caractéristiques. Comparativement aux travaux précédents sur la prédiction d'une violation de la VNS, nos modèles ont été entraînés sur des ensembles de données réels introduisant ainsi de nouveaux défis. Nous avons validé le tout en utilisant Google Cloud Cluster trace comme avec l’ensemble de données. Les violations de la VNS étant des évènements rares 2.2 %, leur classification automatique reste une tâche difficile. Un modèle de classification aura en effet une forte tendance à prédire la classe dominante au détriment des classes rares. Pour répondre à ce problème, il existe plusieurs méthodes de ré-échantillonages telles que Random Over-Sampling, Under-Sampling, SMOTH, NearMiss, One-sided Selection, Neighborhood Cleaning Rule. Il est donc possible de les combiner afin de ré-équilibrer le jeu de données.Cloud computing reduces the maintenance costs of services and allows users to access on demand services without being involved in technical implementation details. The relationship between a cloud provider and a customer is governed with a Service Level Agreement (SLA) that is established to define the level of the service and its associated costs. SLA usually contains specific parameters and a minimum level of quality for each element of the service that is negotiated between a cloud provider and a customer. However, one or more than one of the agreed terms in an SLA might be violated due to several issues such as occasional technical problems. Violations do happen in real world. In terms of availability, Amazon Elastic Cloud faced an outage in 2011 when it crashed and many large customers such as Reddit and Quora were down for more than one day. As SLA violation prediction benefits both user and cloud provider, in recent years, cloud researchers have started investigating models that are capable of prediction future violations. From a Machine Learning point of view, the problem of SLA violation prediction amounts to a binary classification problem. In this thesis, we explore two Machine Learning classification models: Naive Bayes and Random Forest to predict future violations using features of a submitted task. Unlike previous works on SLA violation prediction or avoidance, our models are trained on a real world dataset which introduces new challenges. We validate our models using Google Cloud Cluster trace as the dataset. Since SLA violations are rare events in real world 2.2 %, the classification task becomes more challenging because the classifier will always have the tendency to predict the dominant class. In order to overcome this issue, we use several re-sampling methods such as Random Over-Sampling, Under-Sampling, SMOTH, NearMiss, One-sided Selection, Neighborhood Cleaning Rule and an ensemble of them to re-balance the dataset

    Assuring virtual network reliability and resilience

    Full text link
    A framework developed that uses reliability block diagrams and continuous-time Markov chains to model and analyse the reliability and availability of a Virtual Network Environment (VNE). In addition, to minimize the unpredicted failures and reduce the impact of failure on a virtual network, a dynamic solution proposed for detecting a failure before it occurs in the VNE. Moreover, to predict failure and establish a tolerable maintenance plan before failure occurs in the VNE, a failure prediction method for VNE can be used to minimise the unpredicted failures, reduce backup redundancy and maximise system performance

    Learning from accidents : machine learning for safety at railway stations

    Get PDF
    In railway systems, station safety is a critical aspect of the overall structure, and yet, accidents at stations still occur. It is time to learn from these errors and improve conventional methods by utilizing the latest technology, such as machine learning (ML), to analyse accidents and enhance safety systems. ML has been employed in many fields, including engineering systems, and it interacts with us throughout our daily lives. Thus, we must consider the available technology in general and ML in particular in the context of safety in the railway industry. This paper explores the employment of the decision tree (DT) method in safety classification and the analysis of accidents at railway stations to predict the traits of passengers affected by accidents. The critical contribution of this study is the presentation of ML and an explanation of how this technique is applied for ensuring safety, utilizing automated processes, and gaining benefits from this powerful technology. To apply and explore this method, a case study has been selected that focuses on the fatalities caused by accidents at railway stations. An analysis of some of these fatal accidents as reported by the Rail Safety and Standards Board (RSSB) is performed and presented in this paper to provide a broader summary of the application of supervised ML for improving safety at railway stations. Finally, this research shows the vast potential of the innovative application of ML in safety analysis for the railway industry

    Anomaly Detection in Cloud-Native systems

    Get PDF
    In recent years, microservices have gained popularity due to their benefits such as increased maintainability and scalability of the system. The microservice architectural pattern was adopted for the development of a large scale system which is commonly deployed on public and private clouds, and therefore the aim is to ensure that it always maintains an optimal level of performance. Consequently, the system is monitored by collecting different metrics including performancerelated metrics. The first part of this thesis focuses on the creation of a dataset of realistic time series with anomalies at deterministic locations. This dataset addresses the lack of labeled data for training of supervised models and the absence of publicly available data, in fact the data are not usually shared due to privacy concerns. The second part consists of an empirical study on the detection of anomalies occurring in the different services that compose the system. Specifically, the aim is to understand if it is possible to predict the anomalies in order to perform actions before system failures or performance degradation. Consequently, eight different classification-based Machine Learning algorithms were compared by collecting accuracy, training time and testing time, to figure out which technique might be most suitable for reducing system overload. The results showed that there are strong correlations between metrics and that it is possible to predict the anomalies in the system with approximately 90% of accuracy. The most important outcome is that performance-related anomalies can be detected by monitoring a limited number of metrics collected at runtime with a short training time. Future work includes the adoption of prediction-based approaches and the development of some tools for the prediction of anomalies in cloud native environments

    Data center's telemetry reduction and prediction through modeling techniques

    Get PDF
    Nowadays, Cloud Computing is widely used to host and deliver services over the Internet. The architecture of clouds is complex due to its heterogeneous nature of hardware and is hosted in large scale data centers. To effectively and efficiently manage such complex infrastructure, constant monitoring is needed. This monitoring generates large amounts of telemetry data streams (e.g. hardware utilization metrics) which are used for multiple purposes including problem detection, resource management, workload characterization, resource utilization prediction, capacity planning, and job scheduling. These telemetry streams require costly bandwidth utilization and storage space particularly at medium-long term for large data centers. Moreover, accurate future estimation of these telemetry streams is a challenging task due to multi-tenant co-hosted applications and dynamic workloads. The inaccurate estimation leads to either under or over-provisioning of data center resources. In this Ph.D. thesis, we propose to improve the prediction accuracy and reduce the bandwidth utilization and storage space requirement with the help of modeling and prediction methods from machine learning. Most of the existing methods are based on a single model which often does not appropriately estimate different workload scenarios. Moreover, these prediction methods use a fixed size of observation windows which cannot produce accurate results because these are not adaptively adjusted to capture the local trends in the recent data. Therefore, the estimation method trains on fixed sliding windows use an irrelevant large number of observations which yields inaccurate estimations. In summary, we C1) efficiently reduce bandwidth and storage for telemetry data through real-time modeling using Markov chain model. C2) propose a novel method to adaptively and automatically identify the most appropriate model to accurately estimate data center resources utilization. C3) propose a deep learning-based adaptive window size selection method which dynamically limits the sliding window size to capture the local trend in the latest resource utilization for building estimation model.Hoy en día, Cloud Computing se usa ampliamente para alojar y prestar servicios a través de Internet. La arquitectura de las nubes es compleja debido a su naturaleza heterogénea del hardware y está alojada en centros de datos a gran escala. Para administrar de manera efectiva y eficiente dicha infraestructura compleja, se necesita un monitoreo constante. Este monitoreo genera grandes cantidades de flujos de datos de telemetría (por ejemplo, métricas de utilización de hardware) que se utilizan para múltiples propósitos, incluyendo detección de problemas, gestión de recursos, caracterización de carga de trabajo, predicción de utilización de recursos, planificación de capacidad y programación de trabajos. Estas transmisiones de telemetría requieren una utilización costosa del ancho de banda y espacio de almacenamiento, particularmente a mediano y largo plazo para grandes centros de datos. Además, la estimación futura precisa de estas transmisiones de telemetría es una tarea difícil debido a las aplicaciones cohospedadas de múltiples inquilinos y las cargas de trabajo dinámicas. La estimación inexacta conduce a un suministro insuficiente o excesivo de los recursos del centro de datos. En este Ph.D. En la tesis, proponemos mejorar la precisión de la predicción y reducir la utilización del ancho de banda y los requisitos de espacio de almacenamiento con la ayuda de métodos de modelado y predicción del aprendizaje automático. La mayoría de los métodos existentes se basan en un modelo único que a menudo no estima adecuadamente diferentes escenarios de carga de trabajo. Además, estos métodos de predicción utilizan un tamaño fijo de ventanas de observación que no pueden producir resultados precisos porque no se ajustan adaptativamente para capturar las tendencias locales en los datos recientes. Por lo tanto, el método de estimación entrena en ventanas corredizas fijas utiliza un gran número de observaciones irrelevantes que produce estimaciones inexactas. En resumen, C1) reducimos eficientemente el ancho de banda y el almacenamiento de datos de telemetría a través del modelado en tiempo real utilizando el modelo de cadena de Markov. C2) proponer un método novedoso para identificar de forma adaptativa y automática el modelo más apropiado para estimar con precisión la utilización de los recursos del centro de datos. C3) proponer un método de selección de tamaño de ventana adaptativo basado en el aprendizaje profundo que limita dinámicamente el tamaño de ventana deslizante para capturar la tendencia local en la última utilización de recursos para el modelo de estimación de construcción.Postprint (published version
    corecore