Search CORE

16 research outputs found

A Holistic Approach to Log Data Analysis in High-Performance Computing Systems: The Case of IBM Blue Gene/Q

Author: B Javadi
F Falciano
F Salfner
GL Valentini
Publication venue
Publication date: 01/01/2015
Field of study

The complexity and cost of managing high-performance computing infrastructures are on the rise. Automating management and repair through predictive models to minimize human interventions is an attempt to increase system availability and contain these costs. Building predictive models that are accurate enough to be useful in automatic management cannot be based on restricted log data from subsystems but requires a holistic approach to data analysis from disparate sources. Here we provide a detailed multi-scale characterization study based on four datasets reporting power consumption, temperature, workload, and hardware/software events for an IBM Blue Gene/Q installation. We show that the system runs a rich parallel workload, with low correlation among its components in terms of temperature and power, but higher correlation in terms of events. As expected, power and temperature correlate strongly, while events display negative correlations with load and power. Power and workload show moderate correlations, and only at the scale of components. The aim of the study is a systematic, integrated characterization of the computing infrastructure and discovery of correlation sources and levels to serve as basis for future predictive modeling efforts.Comment: 12 pages, 7 Figure

arXiv.org e-Print Archive

Crossref

Archivio della Ricerca - Università di Pisa

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Complex decision making as a source of infotainment

Author: E.R. Shellman
F. Salfner
H.E. Mansour
K.J. Rothenhaus
Publication venue
Publication date: 03/11/2011
Field of study

Abstract In many policy processes nowadays a variety of actors is involved which results in complex decision making processes, since these different actors have various perspectives on the problem and the matching solutions. Such complex processes are difficult to grasp in short reports in newspapers or on television, especially since journalists have to deal with increasing time pressures and demands to make news items more entertaining. This leads to biases in the construction of the policy processes. In this study we examine whether the biases of fragmentization, dramatization, personalization, the authority-disorder bias and the negativity bias can be found in media reporting on complex decision making processes in the Netherlands. We conducted a quantitative content analysis on media reports on five complex water management projects in the Netherlands. We found that in these media reports stories are often fragmentized, dramatized and unfavourably towards the project, and frequently an authority is blamed for not taking appropriates measures. Certain actors take advantage of these biases more than other actors: media attention for oppositional politicians and interest groups in particular relate significantly to the media biases

Crossref

Erasmus University Digital Repository

Classification in sparse, high dimensional environments applied to distributed systems failure prediction

Author: A.S. Tanenbaum
B. Schroeder
F. Salfner
G. King
H. Zou
M. Gallet
N. Trendafilov
W. Ahmed
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Network failures are still one of the main causes of distributed systems’ lack of reliability. To overcome this problem we present an improvement over a failure prediction system, based on Elastic Net Logistic Regression and the application of rare events prediction techniques, able to work with sparse, high dimensional datasets. Specifically, we prove its stability, fine tune its hyperparameter and improve its industrial utility by showing that, with a slight change in dataset creation, it can also predict the location of a failure, a key asset when trying to take a proactive approach to failure management

Crossref

Archivo Digital UPM (Univ. Politécnica de Madrid)

Unveiling clusters of events for alert and incident management in large-scale enterprise it

Author: Langville A. N.
P.
Salfner F.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

On the Optimum Checkpointing Interval Selection for Variable Size Checkpoint Dumps

Author: E. Gelenbe
F. Salfner
I.P. Egwutuoha
J.T. Daly
J.W. Young
L. Zhu
M.-S. Bouguerra
S. Toueg
T. Ozaki
Y. Ling
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Crossref

CEP4CMA: Multi-layer Cloud Performance Monitoring and Analysis via Complex Event Processing

Author: DC Crocker
F Faul
F Salfner
G Cugola
KL Nance
ML Massie
R Taylor
S Bhaumik
SA Chaves De
W Hagen von
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Towards operator-less data centers through data-driven, predictive, proactive autonomics

Author: A Rosà
AK Mishra
Alina Sîrbu
B Javadi
F Salfner
J Tigani
L Rokach
M Galar
Ozalp Babaoglu
Q Zhang
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

The Failure Prediction of Cluster Systems Based on System Logs

Author: A. Pecchia
E.W. Fulp
F. Salfener
F. Salfner
G. Jiexing
I. Fronza
L. Yinglung
L. Yinglung
M. Joshi
P. Gujrati
Q. Guan
R.K. Sahoo
W. Wenjian
X. Fu
X. Zhenghua
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Crossref

Failure prediction for HPC systems and applications

Author: Aupy G
Bolander N
Chen MY
DiMartino C
Farr W
Gertsbakh I
Gu J
Nassar FA
Rajachandrasekar R
Salfner F
Stearley J
Wang C
Yu L
Zheng GB
Publication venue: 'SAGE Publications'
Publication date
Field of study

Crossref

Explainable Deep Learning for Fault Prognostics in Complex Systems: A Particle Accelerator Use-Case

Author: B Zhao
C Bergmeir
EW Fulp
F Pedregosa
F Salfner
G Montavon
H Ismail Fawaz
I Fronza
J Mori
K Leahy
M Bach-Andersen
M Eichler
S Bach
S Khan
T Hastie
W Samek
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 25/08/2020
Field of study

Sophisticated infrastructures often exhibit misbehaviour and failures resulting from complex interactions of their constituent subsystems. Such infrastructures use alarms, event and fault information, which is recorded to help diagnose and repair failure conditions by operations experts. This data can be analysed using explainable artificial intelligence to attempt to reveal precursors and eventual root causes. The proposed method is first applied to synthetic data in order to prove functionality. With synthetic data the framework makes extremely precise predictions and root causes can be identified correctly. Subsequently, the method is applied to real data from a complex particle accelerator system. In the real data setting, deep learning models produce accurate predictive models from less than ten error examples when precursors are captured. The approach described herein is a potentially valuable tool for operations experts to identify precursors in complex infrastructures

Crossref

HAL Descartes

CERN Document Server