2,144 research outputs found
Online Fault Classification in HPC Systems through Machine Learning
As High-Performance Computing (HPC) systems strive towards the exascale goal,
studies suggest that they will experience excessive failure rates. For this
reason, detecting and classifying faults in HPC systems as they occur and
initiating corrective actions before they can transform into failures will be
essential for continued operation. In this paper, we propose a fault
classification method for HPC systems based on machine learning that has been
designed specifically to operate with live streamed data. We cast the problem
and its solution within realistic operating constraints of online use. Our
results show that almost perfect classification accuracy can be reached for
different fault types with low computational overhead and minimal delay. We
have based our study on a local dataset, which we make publicly available, that
was acquired by injecting faults to an in-house experimental HPC system.Comment: Accepted for publication at the Euro-Par 2019 conferenc
ALBADross: active learning based anomaly diagnosis for production HPC systems
000000000000000000000000000000000000000000000000000002263712 - Sandia National Laboratories; Sandia National LaboratoriesAccepted manuscrip
Improving efficiency and resilience in large-scale computing systems through analytics and data-driven management
Applications running in large-scale computing systems such as high performance computing (HPC) or cloud data centers are essential to many aspects of modern society, from weather forecasting to financial services. As the number and size of data centers increase with the growing computing demand, scalable and efficient management becomes crucial. However, data center management is a challenging task due to the complex interactions between applications, middleware, and hardware layers such as processors, network, and cooling units.
This thesis claims that to improve robustness and efficiency of large-scale computing systems, significantly higher levels of automated support than what is available in today's systems are needed, and this automation should leverage the data continuously collected from various system layers. Towards this claim, we propose novel methodologies to automatically diagnose the root causes of performance and configuration problems and to improve efficiency through data-driven system management.
We first propose a framework to diagnose software and hardware anomalies that cause undesired performance variations in large-scale computing systems. We show that by training machine learning models on resource usage and performance data collected from servers, our approach successfully diagnoses 98% of the injected anomalies at runtime in real-world HPC clusters with negligible computational overhead.
We then introduce an analytics framework to address another major source of performance anomalies in cloud data centers: software misconfigurations. Our framework discovers and extracts configuration information from cloud instances such as containers or virtual machines. This is the first framework to provide comprehensive visibility into software configurations in multi-tenant cloud platforms, enabling systematic analysis for validating the correctness of software configurations.
This thesis also contributes to the design of robust and efficient system management methods that leverage continuously monitored resource usage data. To improve performance under power constraints, we propose a workload- and cooling-aware power budgeting algorithm that distributes the available power among servers and cooling units in a data center, achieving up to 21% improvement in throughput per Watt compared to the state-of-the-art. Additionally, we design a network- and communication-aware HPC workload placement policy that reduces communication overhead by up to 30% in terms of hop-bytes compared to existing policies.2019-07-02T00:00:00
A Raspberry Pi-based Traumatic Brain Injury Detection System for Single-Channel Electroencephalogram
Traumatic Brain Injury (TBI) is a common cause of death and disability.
However, existing tools for TBI diagnosis are either subjective or require
extensive clinical setup and expertise. The increasing affordability and
reduction in size of relatively high-performance computing systems combined
with promising results from TBI related machine learning research make it
possible to create compact and portable systems for early detection of TBI.
This work describes a Raspberry Pi based portable, real-time data acquisition,
and automated processing system that uses machine learning to efficiently
identify TBI and automatically score sleep stages from a single-channel
Electroen-cephalogram (EEG) signal. We discuss the design, implementation, and
verification of the system that can digitize EEG signal using an Analog to
Digital Converter (ADC) and perform real-time signal classification to detect
the presence of mild TBI (mTBI). We utilize Convolutional Neural Networks (CNN)
and XGBoost based predictive models to evaluate the performance and demonstrate
the versatility of the system to operate with multiple types of predictive
models. We achieve a peak classification accuracy of more than 90% with a
classification time of less than 1 s across 16 s - 64 s epochs for TBI vs
control conditions. This work can enable development of systems suitable for
field use without requiring specialized medical equipment for early TBI
detection applications and TBI research. Further, this work opens avenues to
implement connected, real-time TBI related health and wellness monitoring
systems.Comment: 12 pages, 6 figure
Vehicle level health assessment through integrated operational scalable prognostic reasoners
Today’s aircraft are very complex in design and need constant monitoring of the
systems to establish the overall health status. Integrated Vehicle Health
Management (IVHM) is a major component in a new future asset management
paradigm where a conscious effort is made to shift asset maintenance from a
scheduled based approach to a more proactive and predictive approach. Its goal is
to maximize asset operational availability while minimising downtime and the
logistics footprint through monitoring deterioration of component conditions.
IVHM involves data processing which comprehensively consists of capturing data
related to assets, monitoring parameters, assessing current or future health
conditions through prognostics and diagnostics engine and providing
recommended maintenance actions.
The data driven prognostics methods usually use a large amount of data to learn
the degradation pattern (nominal model) and predict the future health. Usually
the data which is run-to-failure used are accelerated data produced in lab
environments, which is hardly the case in real life. Therefore, the nominal model
is far from the present condition of the vehicle, hence the predictions will not be
very accurate. The prediction model will try to follow the nominal models which
mean more errors in the prediction, this is a major drawback of the data driven
techniques.
This research primarily presents the two novel techniques of adaptive data driven
prognostics to capture the vehicle operational scalability degradation. Secondary
the degradation information has been used as a Health index and in the Vehicle
Level Reasoning System (VLRS). Novel VLRS are also presented in this research
study. The research described here proposes a condition adaptive prognostics
reasoning along with VLRS
- …