2,144 research outputs found

    Online Fault Classification in HPC Systems through Machine Learning

    Full text link
    As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that they will experience excessive failure rates. For this reason, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures will be essential for continued operation. In this paper, we propose a fault classification method for HPC systems based on machine learning that has been designed specifically to operate with live streamed data. We cast the problem and its solution within realistic operating constraints of online use. Our results show that almost perfect classification accuracy can be reached for different fault types with low computational overhead and minimal delay. We have based our study on a local dataset, which we make publicly available, that was acquired by injecting faults to an in-house experimental HPC system.Comment: Accepted for publication at the Euro-Par 2019 conferenc

    ALBADross: active learning based anomaly diagnosis for production HPC systems

    Full text link
    000000000000000000000000000000000000000000000000000002263712 - Sandia National Laboratories; Sandia National LaboratoriesAccepted manuscrip

    Improving efficiency and resilience in large-scale computing systems through analytics and data-driven management

    Full text link
    Applications running in large-scale computing systems such as high performance computing (HPC) or cloud data centers are essential to many aspects of modern society, from weather forecasting to financial services. As the number and size of data centers increase with the growing computing demand, scalable and efficient management becomes crucial. However, data center management is a challenging task due to the complex interactions between applications, middleware, and hardware layers such as processors, network, and cooling units. This thesis claims that to improve robustness and efficiency of large-scale computing systems, significantly higher levels of automated support than what is available in today's systems are needed, and this automation should leverage the data continuously collected from various system layers. Towards this claim, we propose novel methodologies to automatically diagnose the root causes of performance and configuration problems and to improve efficiency through data-driven system management. We first propose a framework to diagnose software and hardware anomalies that cause undesired performance variations in large-scale computing systems. We show that by training machine learning models on resource usage and performance data collected from servers, our approach successfully diagnoses 98% of the injected anomalies at runtime in real-world HPC clusters with negligible computational overhead. We then introduce an analytics framework to address another major source of performance anomalies in cloud data centers: software misconfigurations. Our framework discovers and extracts configuration information from cloud instances such as containers or virtual machines. This is the first framework to provide comprehensive visibility into software configurations in multi-tenant cloud platforms, enabling systematic analysis for validating the correctness of software configurations. This thesis also contributes to the design of robust and efficient system management methods that leverage continuously monitored resource usage data. To improve performance under power constraints, we propose a workload- and cooling-aware power budgeting algorithm that distributes the available power among servers and cooling units in a data center, achieving up to 21% improvement in throughput per Watt compared to the state-of-the-art. Additionally, we design a network- and communication-aware HPC workload placement policy that reduces communication overhead by up to 30% in terms of hop-bytes compared to existing policies.2019-07-02T00:00:00

    A Raspberry Pi-based Traumatic Brain Injury Detection System for Single-Channel Electroencephalogram

    Full text link
    Traumatic Brain Injury (TBI) is a common cause of death and disability. However, existing tools for TBI diagnosis are either subjective or require extensive clinical setup and expertise. The increasing affordability and reduction in size of relatively high-performance computing systems combined with promising results from TBI related machine learning research make it possible to create compact and portable systems for early detection of TBI. This work describes a Raspberry Pi based portable, real-time data acquisition, and automated processing system that uses machine learning to efficiently identify TBI and automatically score sleep stages from a single-channel Electroen-cephalogram (EEG) signal. We discuss the design, implementation, and verification of the system that can digitize EEG signal using an Analog to Digital Converter (ADC) and perform real-time signal classification to detect the presence of mild TBI (mTBI). We utilize Convolutional Neural Networks (CNN) and XGBoost based predictive models to evaluate the performance and demonstrate the versatility of the system to operate with multiple types of predictive models. We achieve a peak classification accuracy of more than 90% with a classification time of less than 1 s across 16 s - 64 s epochs for TBI vs control conditions. This work can enable development of systems suitable for field use without requiring specialized medical equipment for early TBI detection applications and TBI research. Further, this work opens avenues to implement connected, real-time TBI related health and wellness monitoring systems.Comment: 12 pages, 6 figure

    Vehicle level health assessment through integrated operational scalable prognostic reasoners

    Get PDF
    Today’s aircraft are very complex in design and need constant monitoring of the systems to establish the overall health status. Integrated Vehicle Health Management (IVHM) is a major component in a new future asset management paradigm where a conscious effort is made to shift asset maintenance from a scheduled based approach to a more proactive and predictive approach. Its goal is to maximize asset operational availability while minimising downtime and the logistics footprint through monitoring deterioration of component conditions. IVHM involves data processing which comprehensively consists of capturing data related to assets, monitoring parameters, assessing current or future health conditions through prognostics and diagnostics engine and providing recommended maintenance actions. The data driven prognostics methods usually use a large amount of data to learn the degradation pattern (nominal model) and predict the future health. Usually the data which is run-to-failure used are accelerated data produced in lab environments, which is hardly the case in real life. Therefore, the nominal model is far from the present condition of the vehicle, hence the predictions will not be very accurate. The prediction model will try to follow the nominal models which mean more errors in the prediction, this is a major drawback of the data driven techniques. This research primarily presents the two novel techniques of adaptive data driven prognostics to capture the vehicle operational scalability degradation. Secondary the degradation information has been used as a Health index and in the Vehicle Level Reasoning System (VLRS). Novel VLRS are also presented in this research study. The research described here proposes a condition adaptive prognostics reasoning along with VLRS
    corecore