11 research outputs found
Continuous Learning of HPC Infrastructure Models using Big Data Analytics and In-Memory processing Tools
open4siThis work was supported, in parts, by the FP7 ERC Advance project MULTITHERMAN (g.a. 291125), by the EU H2020 FETHPC project ANTAREX (g.a. 67623) and by the EU H2020 FETHPC project Exanode (g.a. 671578).Exascale computing represents the next leap in the HPC race. Reaching this level of performance is subject to several engineering challenges such as energy consumption, equipment-cooling, reliability and massive parallelism. Model-based optimization is an essential tool in the design process and control of energy efficient, reliable and thermally constrained systems. However, in the Exascale domain, model learning techniques tailored to the specific supercomputer require real measurements and must therefore handle and analyze a massive amount of data coming from the HPC monitoring infrastructure. This becomes rapidly a 'big data' scale problem. The common approach where measurements are first stored in large databases and then processed is no more affordable due to the increasingly storage costs and lack of real-time support. Nowadays instead, cloud-based machine learning techniques aim to build on-line models using real-time approaches such as 'stream processing' and 'in-memory' computing, that avoid storage costs and enable fastdata processing. Moreover, the fast delivery and adaptation of the models to the quick data variations, make the decision stage of the optimization loop more effective and reliable. In this paper we leverage scalable, lightweight and flexible IoT technologies, such as the MQTT protocol, to build a highly scalable HPC monitoring infrastructure able to handle the massive sensor data produced by next-gen HPC components. We then show how state-of-the art tools for big data computing and analysis, such as Apache Spark, can be used to manage the huge amount of data delivered by the monitoring layer and to build adaptive models in real-time using on-line machine learning techniques.openBeneventi, Francesco; Bartolini, Andrea; Cavazzoni, Carlo; Benini, LucaBeneventi, Francesco; Bartolini, Andrea; Cavazzoni, Carlo; Benini, Luc
Online Fault Classification in HPC Systems through Machine Learning
As High-Performance Computing (HPC) systems strive towards the exascale goal,
studies suggest that they will experience excessive failure rates. For this
reason, detecting and classifying faults in HPC systems as they occur and
initiating corrective actions before they can transform into failures will be
essential for continued operation. In this paper, we propose a fault
classification method for HPC systems based on machine learning that has been
designed specifically to operate with live streamed data. We cast the problem
and its solution within realistic operating constraints of online use. Our
results show that almost perfect classification accuracy can be reached for
different fault types with low computational overhead and minimal delay. We
have based our study on a local dataset, which we make publicly available, that
was acquired by injecting faults to an in-house experimental HPC system.Comment: Accepted for publication at the Euro-Par 2019 conferenc
Benefits in Relaxing the Power Capping Constraint
open3siWork supported by the EU FETHPC project ANTAREX (g.a. 671623),EU project ExaNoDe (g.a. 671578), and EU ERC Project MULTI-THERMAN (g.a. 291125).In this manuscript we evaluate the impact of HW power capping mechanisms on a real scientific application composed by parallel execution. By comparing HW capping mechanism against static frequency allocation schemes we show that a speed up can be achieved if the power constraint is enforced in average, during the application run, instead of on short time periods. RAPL, which enforces the power constraint on a few ms time scale, fails on sharing power budget between more demanding and less demanding application phases.openCesarini, Daniele; Bartolini, Andrea; Benini, LucaCesarini, Daniele; Bartolini, Andrea; Benini, Luc
Performance and Power Analysis of HPC Workloads on Heterogenous Multi-Node Clusters
Performance analysis tools allow application developers to identify and characterize the inefficiencies that cause performance degradation in their codes, allowing for application optimizations. Due to the increasing interest in the High Performance Computing (HPC) community towards energy-efficiency issues, it is of paramount importance to be able to correlate performance and power figures within the same profiling and analysis tools. For this reason, we present a performance and energy-efficiency study aimed at demonstrating how a single tool can be used to collect most of the relevant metrics. In particular, we show how the same analysis techniques can be applicable on different architectures, analyzing the same HPC application on a high-end and a low-power cluster. The former cluster embeds Intel Haswell CPUs and NVIDIA K80 GPUs, while the latter is made up of NVIDIA Jetson TX1 boards, each hosting an Arm Cortex-A57 CPU and an NVIDIA Tegra X1 Maxwell GPU.The research leading to these results has received funding from the European Community’s Seventh Framework Programme [FP7/2007-2013] and Horizon 2020 under the Mont-Blanc projects [17], grant agreements n. 288777, 610402 and 671697. E.C. was partially founded by “Contributo 5 per mille assegnato all’Università degli Studi di Ferrara-dichiarazione dei redditi dell’anno 2014”. We thank the University of Ferrara and INFN Ferrara for the access to the COKA Cluster. We warmly thank the BSC tools group, supporting us for the smooth integration and test of our setup within Extrae and Paraver.Peer ReviewedPostprint (published version
DCDB Wintermute: Enabling Online and Holistic Operational Data Analytics on HPC Systems
As we approach the exascale era, the size and complexity of HPC systems
continues to increase, raising concerns about their manageability and
sustainability. For this reason, more and more HPC centers are experimenting
with fine-grained monitoring coupled with Operational Data Analytics (ODA) to
optimize efficiency and effectiveness of system operations. However, while
monitoring is a common reality in HPC, there is no well-stated and
comprehensive list of requirements, nor matching frameworks, to support
holistic and online ODA. This leads to insular ad-hoc solutions, each
addressing only specific aspects of the problem.
In this paper we propose Wintermute, a novel generic framework to enable
online ODA on large-scale HPC installations. Its design is based on the results
of a literature survey of common operational requirements. We implement
Wintermute on top of the holistic DCDB monitoring system, offering a large
variety of configuration options to accommodate the varying requirements of ODA
applications. Moreover, Wintermute is based on a set of logical abstractions to
ease the configuration of models at a large scale and maximize code re-use. We
highlight Wintermute's flexibility through a series of practical case studies,
each targeting a different aspect of the management of HPC systems, and then
demonstrate the small resource footprint of our implementation.Comment: Accepted for publication at the 29th ACM International Symposium on
High-Performance Parallel and Distributed Computing (HPDC 2020
From Facility to Application Sensor Data: Modular, Continuous and Holistic Monitoring with DCDB
Today's HPC installations are highly-complex systems, and their complexity
will only increase as we move to exascale and beyond. At each layer, from
facilities to systems, from runtimes to applications, a wide range of tuning
decisions must be made in order to achieve efficient operation. This, however,
requires systematic and continuous monitoring of system and user data. While
many insular solutions exist, a system for holistic and facility-wide
monitoring is still lacking in the current HPC ecosystem. In this paper we
introduce DCDB, a comprehensive monitoring system capable of integrating data
from all system levels. It is designed as a modular and highly-scalable
framework based on a plugin infrastructure. All monitored data is aggregated at
a distributed noSQL data store for analysis and cross-system correlation. We
demonstrate the performance and scalability of DCDB, and describe two use cases
in the area of energy management and characterization.Comment: Accepted at the The International Conference for High Performance
Computing, Networking, Storage, and Analysis (SC) 201
pAElla: Edge-AI based Real-Time Malware Detection in Data Centers
The increasing use of Internet-of-Things (IoT) devices for monitoring a wide
spectrum of applications, along with the challenges of "big data" streaming
support they often require for data analysis, is nowadays pushing for an
increased attention to the emerging edge computing paradigm. In particular,
smart approaches to manage and analyze data directly on the network edge, are
more and more investigated, and Artificial Intelligence (AI) powered edge
computing is envisaged to be a promising direction. In this paper, we focus on
Data Centers (DCs) and Supercomputers (SCs), where a new generation of
high-resolution monitoring systems is being deployed, opening new opportunities
for analysis like anomaly detection and security, but introducing new
challenges for handling the vast amount of data it produces. In detail, we
report on a novel lightweight and scalable approach to increase the security of
DCs/SCs, that involves AI-powered edge computing on high-resolution power
consumption. The method -- called pAElla -- targets real-time Malware Detection
(MD), it runs on an out-of-band IoT-based monitoring system for DCs/SCs, and
involves Power Spectral Density of power measurements, along with AutoEncoders.
Results are promising, with an F1-score close to 1, and a False Alarm and
Malware Miss rate close to 0%. We compare our method with State-of-the-Art MD
techniques and show that, in the context of DCs/SCs, pAElla can cover a wider
range of malware, significantly outperforming SoA approaches in terms of
accuracy. Moreover, we propose a methodology for online training suitable for
DCs/SCs in production, and release open dataset and code