315 research outputs found

    Online Fault Classification in HPC Systems through Machine Learning

    Full text link
    As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that they will experience excessive failure rates. For this reason, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures will be essential for continued operation. In this paper, we propose a fault classification method for HPC systems based on machine learning that has been designed specifically to operate with live streamed data. We cast the problem and its solution within realistic operating constraints of online use. Our results show that almost perfect classification accuracy can be reached for different fault types with low computational overhead and minimal delay. We have based our study on a local dataset, which we make publicly available, that was acquired by injecting faults to an in-house experimental HPC system.Comment: Accepted for publication at the Euro-Par 2019 conferenc

    Detection of Malware Attacks on Virtual Machines for a Self-Heal Approach in Cloud Computing using VM Snapshots

    Get PDF
    Cloud Computing strives to be dynamic as a service oriented architecture. The services in the SoA are rendered in terms of private, public and in many other commercial domain aspects. These services should be secured and thus are very vital to the cloud infrastructure. In order, to secure and maintain resilience in the cloud, it not only has to have the ability to identify the known threats but also to new challenges that target the infrastructure of a cloud. In this paper, we introduce and discuss a detection method of malwares from the VM logs and corresponding VM snapshots are classified into attacked and non-attacked VM snapshots. As snapshots are always taken to be a backup in the backup servers, especially during the night hours, this approach could reduce the overhead of the backup server with a self-healing capability of the VMs in the local cloud infrastructure. A machine learning approach at the hypervisor level is projected, the features being gathered from the API calls of VM instances in the IaaS level of cloud service. Our proposed scheme can have a high detection accuracy of about 93% while having the capability to classify and detect different types of malwares with respect to the VM snapshots. Finally the paper exhibits an algorithm using snapshots to detect and thus to self-heal using the monitoring components of a particular VM instances applied to cloud scenarios. The self-healing approach with machine learning algorithms can determine new threats with some prior knowledge of its functionality

    Robust Learning Enabled Intelligence for the Internet-of-Things: A Survey From the Perspectives of Noisy Data and Adversarial Examples

    Get PDF
    This is the author accepted manuscript. The final version is available from IEEE via the DOI in this recordThe Internet-of-Things (IoT) has been widely adopted in a range of verticals, e.g., automation, health, energy and manufacturing. Many of the applications in these sectors, such as self-driving cars and remote surgery, are critical and high stakes applications, calling for advanced machine learning (ML) models for data analytics. Essentially, the training and testing data that are collected by massive IoT devices may contain noise (e.g., abnormal data, incorrect labels and incomplete information) and adversarial examples. This requires high robustness of ML models to make reliable decisions for IoT applications. The research of robust ML has received tremendous attentions from both academia and industry in recent years. This paper will investigate the state-of-the-art and representative works of robust ML models that can enable high resilience and reliability of IoT intelligence. Two aspects of robustness will be focused on, i.e., when the training data of ML models contains noises and adversarial examples, which may typically happen in many real-world IoT scenarios. In addition, the reliability of both neural networks and reinforcement learning framework will be investigated. Both of these two machine learning paradigms have been widely used in handling data in IoT scenarios. The potential research challenges and open issues will be discussed to provide future research directions.Engineering and Physical Sciences Research Council (EPSRC

    Monitoring and analysis system for performance troubleshooting in data centers

    Get PDF
    It was not long ago. On Christmas Eve 2012, a war of troubleshooting began in Amazon data centers. It started at 12:24 PM, with an mistaken deletion of the state data of Amazon Elastic Load Balancing Service (ELB for short), which was not realized at that time. The mistake first led to a local issue that a small number of ELB service APIs were affected. In about six minutes, it evolved into a critical one that EC2 customers were significantly affected. One example was that Netflix, which was using hundreds of Amazon ELB services, was experiencing an extensive streaming service outage when many customers could not watch TV shows or movies on Christmas Eve. It took Amazon engineers 5 hours 42 minutes to find the root cause, the mistaken deletion, and another 15 hours and 32 minutes to fully recover the ELB service. The war ended at 8:15 AM the next day and brought the performance troubleshooting in data centers to world’s attention. As shown in this Amazon ELB case.Troubleshooting runtime performance issues is crucial in time-sensitive multi-tier cloud services because of their stringent end-to-end timing requirements, but it is also notoriously difficult and time consuming. To address the troubleshooting challenge, this dissertation proposes VScope, a flexible monitoring and analysis system for online troubleshooting in data centers. VScope provides primitive operations which data center operators can use to troubleshoot various performance issues. Each operation is essentially a series of monitoring and analysis functions executed on an overlay network. We design a novel software architecture for VScope so that the overlay networks can be generated, executed and terminated automatically, on-demand. From the troubleshooting side, we design novel anomaly detection algorithms and implement them in VScope. By running anomaly detection algorithms in VScope, data center operators are notified when performance anomalies happen. We also design a graph-based guidance approach, called VFocus, which tracks the interactions among hardware and software components in data centers. VFocus provides primitive operations by which operators can analyze the interactions to find out which components are relevant to the performance issue. VScope’s capabilities and performance are evaluated on a testbed with over 1000 virtual machines (VMs). Experimental results show that the VScope runtime negligibly perturbs system and application performance, and requires mere seconds to deploy monitoring and analytics functions on over 1000 nodes. This demonstrates VScope’s ability to support fast operation and online queries against a comprehensive set of application to system/platform level metrics, and a variety of representative analytics functions. When supporting algorithms with high computation complexity, VScope serves as a ‘thin layer’ that occupies no more than 5% of their total latency. Further, by using VFocus, VScope can locate problematic VMs that cannot be found via solely application-level monitoring, and in one of the use cases explored in the dissertation, it operates with levels of perturbation of over 400% less than what is seen for brute-force and most sampling-based approaches. We also validate VFocus with real-world data center traces. The experimental results show that VFocus has troubleshooting accuracy of 83% on average.Ph.D

    DC-Prophet: Predicting Catastrophic Machine Failures in DataCenters

    Full text link
    When will a server fail catastrophically in an industrial datacenter? Is it possible to forecast these failures so preventive actions can be taken to increase the reliability of a datacenter? To answer these questions, we have studied what are probably the largest, publicly available datacenter traces, containing more than 104 million events from 12,500 machines. Among these samples, we observe and categorize three types of machine failures, all of which are catastrophic and may lead to information loss, or even worse, reliability degradation of a datacenter. We further propose a two-stage framework-DC-Prophet-based on One-Class Support Vector Machine and Random Forest. DC-Prophet extracts surprising patterns and accurately predicts the next failure of a machine. Experimental results show that DC-Prophet achieves an AUC of 0.93 in predicting the next machine failure, and a F3-score of 0.88 (out of 1). On average, DC-Prophet outperforms other classical machine learning methods by 39.45% in F3-score.Comment: 13 pages, 5 figures, accepted by 2017 ECML PKD

    Detection of Malware Attacks on Virtual Machines for a Self-Heal Approach in Cloud Computing using VM Snapshots

    Get PDF
    Cloud Computing strives to be dynamic as a service oriented architecture. The services in the SoA are rendered in terms of private, public and in many other commercial domain aspects. These services should be secured and thus are very vital to the cloud infrastructure. In order, to secure and maintain resilience in the cloud, it not only has to have the ability to identify the known threats but also to new challenges that target the infrastructure of a cloud. In this paper, we introduce and discuss a detection method of malwares from the VM logs and corresponding VM snapshots are classified into attacked and non-attacked VM snapshots. As snapshots are always taken to be a backup in the backup servers, especially during the night hours, this approach could reduce the overhead of the backup server with a self-healing capability of the VMs in the local cloud infrastructure. A machine learning approach at the hypervisor level is projected, the features being gathered from the API calls of VM instances in the IaaS level of cloud service. Our proposed scheme can have a high detection accuracy of about 93% while having the capability to classify and detect different types of malwares with respect to the VM snapshots. Finally the paper exhibits an algorithm using snapshots to detect and thus to self-heal using the monitoring components of a particular VM instances applied to cloud scenarios. The self-healing approach with machine learning algorithms can determine new threats with some prior knowledge of its functionality

    Cloud Computing Technology in the Development of Digital Libraries to Increase Literacy in Indonesia

    Get PDF
    Problem Statement: Cloud computing has the potential to increase literacy in Indonesia by enabling wider and more efficient access to sources of information through the creation of digital libraries that can be accessed anywhere and anytime. Purpose: In this analysis we can understand how Indonesian digital libraries are doing right now and outlines the difficulties they confront in giving their users access to information and resources. Method: This research is a descriptive qualitative research. Researchers chose to use this method because this research focuses on complex social phenomena, namely increasing literacy through the use of digital libraries supported by cloud processing technology. Result: The advantages of cloud computing technology for digital libraries, such as improved accessibility, improved cooperation, and increased efficiency, are also covered in the article. Conclusion: At the article's suggestions are made for using and implementing cloud computing technologies to increase literacy rates in Indonesia.    Keywords: Literacy; Digital  Libraries; Cloud Computing   Abstrak Permasalahan: Komputasi awan memiliki potensi untuk meningkatkan literasi di Indonesia dengan memungkinkan akses yang lebih luas dan efisien ke sumber informasi melalui pembuatan perpustakaan digital yang dapat diakses di mana saja dan kapan saja. Tujuan: Dalam analisis ini kita dapat memahami bagaimana perpustakaan digital Indonesia saat ini dan menguraikan kesulitan yang mereka hadapi dalam memberikan akses informasi dan sumber daya kepada penggunanya. Metode: Penelitian ini merupakan penelitian kualitatif deskriptif. Peneliti memilih menggunakan metode ini karena penelitian ini berfokus pada fenomena sosial yang kompleks, yaitu peningkatan literasi melalui penggunaan perpustakaan digital yang didukung oleh teknologi pemrosesan awan. Hasil: Keuntungan teknologi komputasi awan untuk perpustakaan digital ini seperti, peningkatan aksesibilitas, peningkatan kerja sama, dan peningkatan efisiensi. Kesimpulan: Pada artikel tersebut diusulkan penggunaan dan implementasi teknologi komputasi awan untuk meningkatkan angka melek huruf di Indonesia.  Kata kunci: Literasi; Perpustakaan Digital; Komputasi Awa
    corecore