Search CORE

5 research outputs found

An online failure prediction system for private IaaS platforms

Author: Capelastegui de la Concha Pedro
Dueñas López Juan Carlos
Garcia Carmona Rodrigo
Huertas Ferrer Francisco
Navas Baltasar Alvaro
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2013
Field of study

The size and complexity of cloud environments make them prone to failures. The traditional approach to achieve a high dependability for these systems relies on constant monitoring. However, this method is purely reactive. A more proactive approach is provided by online failure prediction (OFP) techniques. In this paper, we describe a OFP system for private IaaS platforms, currently under development, that combines di_erent types of data input, including monitoring information, event logs, and failure data. In addition, this system operates at both the physical and virtual planes of the cloud, taking into account the relationships between nodes and failure propagation mechanisms that are unique to cloud environments

Crossref

Archivo Digital UPM

System failure prediction through rare-events elastic-net logistic regression

Author: Dueñas López Juan Carlos
Navarro González José Manuel
Parada Gélvez Hugo Alexer
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2014
Field of study

Predicting failures in a distributed system based on previous events through logistic regression is a standard approach in literature. This technique is not reliable, though, in two situations: in the prediction of rare events, which do not appear in enough proportion for the algorithm to capture, and in environments where there are too many variables, as logistic regression tends to overfit on this situations; while manually selecting a subset of variables to create the model is error- prone. On this paper, we solve an industrial research case that presented this situation with a combination of elastic net logistic regression, a method that allows us to automatically select useful variables, a process of cross-validation on top of it and the application of a rare events prediction technique to reduce computation time. This process provides two layers of cross- validation that automatically obtain the optimal model complexity and the optimal mode l parameters values, while ensuring even rare events will be correctly predicted with a low amount of training instances. We tested this method against real industrial data, obtaining a total of 60 out of 80 possible models with a 90% average model accuracy

Archivo Digital UPM

Classification in sparse, high dimensional environments applied to distributed systems failure prediction

Author: A.S. Tanenbaum
B. Schroeder
F. Salfner
G. King
H. Zou
M. Gallet
N. Trendafilov
W. Ahmed
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Network failures are still one of the main causes of distributed systems’ lack of reliability. To overcome this problem we present an improvement over a failure prediction system, based on Elastic Net Logistic Regression and the application of rare events prediction techniques, able to work with sparse, high dimensional datasets. Specifically, we prove its stability, fine tune its hyperparameter and improve its industrial utility by showing that, with a slight change in dataset creation, it can also predict the location of a failure, a key asset when trying to take a proactive approach to failure management

Crossref

Archivo Digital UPM

ビッグデータを活用した障害予測に関する実験的検証

Author: 伊藤利佳
藤田直行
Publication venue: 目白大学
Publication date: 29/02/2020
Field of study

近年の計算機環境の大規模化に伴い，ハードウェアシステムの障害の影響も年々大きくなっている。そのため，ハードウェアの障害を事前に予測するシステムの構築が求められている。研究所や企業などにおいてハードウェア障害が発生すると，管理者は原因の究明と復旧作業などの対応に追われ，円滑なコンピュータ利用のサービスが妨げられる。しかしながら，障害の起きる原因は多様であり，パフォーマンスの低下やトラフィック状況などのハードウェアの内部情報を監視しているだけでは障害の予兆を捉えることは難しい。そのため，コンピュータの内部情報やシステムの設置状況などの外部情報を包括的に精査することによって障害の予兆を捉えるための研究を行っている。しかし，障害を起こしたハードウェアの状況把握のための内部情報を収集することは容易ではない。そこで，本研究では，ハードディスクに備えられているS.M.A.R.T.（Self-Monitoring, Analysis and Reporting Technology）情報を用いて機械学習を実施することによって，ハードディスク障害の予測に関する解析を行い，その結果を報告する。The large-scale expansion of the computing environment in recent years has also seen an increase in hardware system failures. The construction of a system that would predict such hardware failures beforehand is therefore in demand. When hardware failures occur not only is the administrator pressed with tasks such as investigating the cause and recovery procedures, these failures also hinder the smooth services used by the computer users.However it is difficult to detect failure symptoms by only monitoring the hardware internal information such as performance degradation. Hence, the system is needed that detects these failure symptoms by comprehensively examining the internal information of computers and external information such as the statuses of system installations. In this study, we performed an analysis on the prediction of hard disk failures by applying machine learning to the S.M.A.R.T. data generated from hard disk system and report the results

Mejiro University Repository / 目白大学リポジトリ

Data Driven Device Failure Prediction

Author: Jordan Paul L.
Publication venue: AFIT Scholar
Publication date: 15/09/2016
Field of study

As society becomes more dependent upon computer systems to perform increasingly critical tasks, ensuring those systems do not fail also becomes more important. Many organizations depend heavily on desktop computers for day to day operations. Unfortunately, the software that runs on these computers is still written by humans and as such, is still subject to human error and consequent failure. A natural solution is to use statistical machine learning to predict failure. However, since failure is still a relatively rare event, obtaining labeled training data to train these models is not trivial. This work presents new simulated fault loads with an automated framework to predict failure in the Microsoft enterprise authentication service and Apache web server in an effort to increase up-time and improve mission effectiveness. These new fault loads were successful in creating realistic failure conditions that are accurately identified by statistical learning models

AFTI Scholar (Air Force Institute of Technology)