Search CORE

721 research outputs found

Contextual linking between workflow provenance and system performance logs

Author: El Khaldi Ahanach E.
Koulouzis S.
Zhao Z.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

International Migration, Integration and Social Cohesion online publications

Using simple PID-inspired controllers for online resilient resource management of distributed scientific workflows

Author: Atkinson Malcolm P.
Deelman Ewa
Filgueira Rosa
Overton Ian M.
Pairo-Castineira Erola
Silva Rafael Ferreira da
Publication venue: 'Elsevier BV'
Publication date: 01/06/2019
Field of study

Scientific workflows have become mainstream for conducting large-scale scientific research. As a result, many workflow applications and Workflow Management Systems (WMSs) have been developed as part of the cyberinfrastructure to allow scientists to execute their applications seamlessly on a range of distributed platforms. Although the scientific community has addressed this challenge from both theoretical and practical approaches, failure prediction, detection, and recovery still raise many research questions. In this paper, we propose an approach inspired by the control theory developed as part of autonomic computing to predict failures before they happen, and mitigated them when possible. The proposed approach is inspired on the proportional–integral–derivative controller (PID controller) control loop mechanism, which is widely used in industrial control systems, where the controller will react to adjust its output to mitigate faults. PID controllers aim to detect the possibility of a non-steady state far enough in advance so that an action can be performed to prevent it from happening. To demonstrate the feasibility of the approach, we tackle two common execution faults of large scale data-intensive workflows—data storage overload and memory overflow. We developed a simulator, which implements and evaluates simple standalone PID-inspired controllers to autonomously manage data and memory usage of a data-intensive bioinformatics workflow that consumes/produces over 4.4 TB of data, and requires over 24 TB of memory to run all tasks concurrently. Experimental results obtained via simulation indicate that workflow executions may significantly benefit from the controller-inspired approach, in particular under online and unknown conditions. Simulation results show that nearly-optimal executions (slowdown of 1.01) can be attained when using our proposed method, and faults are detected and mitigated far in advance of their occurrence

Queen's University Belfast Research Portal

Heriot Watt Pure

Edinburgh Research Explorer

University of St. Andrews - Pure

NERC Open Research Archive

Review and Analysis of Failure Detection and Prevention Techniques in IT Infrastructure Monitoring

Author: Bhanage Deepali Arun
Kotecha K
Pawar Ambika Vishal
Publication venue: DigitalCommons@University of Nebraska - Lincoln
Publication date: 04/03/2021
Field of study

Maintaining the health of IT infrastructure components for improved reliability and availability is a research and innovation topic for many years. Identification and handling of failures are crucial and challenging due to the complexity of IT infrastructure. System logs are the primary source of information to diagnose and fix failures. In this work, we address three essential research dimensions about failures, such as the need for failure handling in IT infrastructure, understanding the contribution of system-generated log in failure detection and reactive & proactive approaches used to deal with failure situations. This study performs a comprehensive analysis of existing literature by considering three prominent aspects as log preprocessing, anomaly & failure detection, and failure prevention. With this coherent review, we (1) presume the need for IT infrastructure monitoring to avoid downtime, (2) examine the three types of approaches for anomaly and failure detection such as a rule-based, correlation method and classification, and (3) fabricate the recommendations for researchers on further research guidelines. As far as the authors\u27 knowledge, this is the first comprehensive literature review on IT infrastructure monitoring techniques. The review has been conducted with the help of meta-analysis and comparative study of machine learning and deep learning techniques. This work aims to outline significant research gaps in the area of IT infrastructure failure detection. This work will help future researchers understand the advantages and limitations of current methods and select an adequate approach to their problem

DigitalCommons@University of Nebraska

Online Self-Healing Control Loop to Prevent and Mitigate Faults in Scientific Workflows

Author: Atkinson Malcolm
Deelman Ewa
Ferreira da Silva Rafael
Filgueira Rosa
Overton Ian
Pairo-Castineira Erola
Publication venue
Publication date: 18/11/2016
Field of study

Scientific workflows have become mainstream for conducting large-scale scientific research. As a result, many workflow applications and Workflow Management Systems (WMSs) have been developed as part of the cyberinfrastructure to allow scientists to execute their applications seamlessly on a range of distributed platforms. In spite of many success stories, a key challenge for running workflow in distributed systems is failure prediction, detection, and recovery. In this paper, we present a novel online self-healing framework, where failures are predicted before they happen, and are mitigated when possible. The proposed approach is to use control theory developed as part of autonomic computing, and in particular apply the proportional-integral-derivative controller (PID controller) control loop mechanism, which is widely used in industrial control systems, to mitigate faults by adjusting the inputs of the mechanism. The PID controller aims at detecting the possibility of a fault far enough in advance so that an action can be performed to prevent it from happening. To demonstrate the feasibility of the approach, we tackle two common execution faults of the Big Data era—data footprint and memory usage. We define, implement, and evaluate PID controllers to autonomously manage data and memory usage of a bioinformatics workflow that consumes/produces over 4.4TB of data, and requires over 24TB of memory to run all tasks concurrently. Experimental results indicate that workflow executions may significantly benefit from PID controllers, in particular under online and unknown conditions. Simulation results show that nearly-optimal executions (slowdown of 1.01) can be attained when using our proposed control loop, and faults are detected and mitigated far in advance

Heriot Watt Pure

Edinburgh Research Explorer

University of St. Andrews - Pure

Monitoring and Optimization of ATLAS Tier 2 Center GoeGrid

Author: Magradze Erekle
Publication venue: University Goettingen Repository
Publication date: 11/01/2016
Field of study

The demand on computational and storage resources is growing along with the amount of infor- mation that needs to be processed and preserved. In order to ease the provisioning of the digital services to the growing number of consumers, more and more distributed computing systems and platforms are actively developed and employed. The building block of the distributed computing infrastructure are single computing centers, similar to the Worldwide LHC Computing Grid, Tier 2 centre GoeGrid. The main motivation of this thesis was the optimization of GoeGrid perfor- mance by efficient monitoring. The goal has been achieved by means of the GoeGrid monitoring information analysis. The data analysis approach was based on the adaptive-network-based fuzzy inference system (ANFIS) and machine learning algorithm such as Linear Support Vector Machine (SVM). The main object of the research was the digital service, since availability, reliability and ser- viceability of the computing platform can be measured according to the constant and stable provisioning of the services. Due to the widely used concept of the service oriented architecture (SOA) for large computing facilities, in advance knowing of the service state as well as the quick and accurate detection of its disability allows to perform the proactive management of the com- puting facility. The proactive management is considered as a core component of the computing facility management automation concept, such as Autonomic Computing. Thus in time as well as in advance and accurate identification of the provided service status can be considered as a contribution to the computing facility management automation, which is directly related to the provisioning of the stable and reliable computing resources. Based on the case studies, performed using the GoeGrid monitoring data, consideration of the approaches as generalized methods for the accurate and fast identification and prediction of the service status is reasonable. Simplicity and low consumption of the computing resources allow to consider the methods in the scope of the Autonomic Computing component

Georg-August-University Göttingen

CERN Document Server

AI for IT Operations (AIOps) on Cloud Platforms: Reviews, Opportunities and Challenges

Author: Cheng Qian
Hoi Steven C. H.
Liu Chenghao
Saha Amrita
Sahoo Doyen
Saverese Silvio
Singh Manpreet
Woo Gerald
Yang Wenzhuo
Publication venue
Publication date: 10/04/2023
Field of study

Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big data generated by IT Operations processes, particularly in cloud infrastructures, to provide actionable insights with the primary goal of maximizing availability. There are a wide variety of problems to address, and multiple use-cases, where AI capabilities can be leveraged to enhance operational efficiency. Here we provide a review of the AIOps vision, trends challenges and opportunities, specifically focusing on the underlying AI techniques. We discuss in depth the key types of data emitted by IT Operations activities, the scale and challenges in analyzing them, and where they can be helpful. We categorize the key AIOps tasks as - incident detection, failure prediction, root cause analysis and automated actions. We discuss the problem formulation for each task, and then present a taxonomy of techniques to solve these problems. We also identify relatively under explored topics, especially those that could significantly benefit from advances in AI literature. We also provide insights into the trends in this field, and what are the key investment opportunities

arXiv.org e-Print Archive

Requirements of the SALTY project

Author: Abchir Mohammed-Amine
Bathias Thierry
Blay-Fornarino Mireille
Collet Philippe
Krikava Filip
Lenoir Julien
Lesbegueries Julien
Madelénat Sébastien
Malenfant Jacques
Manset David
Melekhova Olga
Mollon Rémi
Montagnat Johan
Nzekwa Russel
Pappa Anna
Revillard Jérôme
Rouvoy Romain
Seinturier Lionel
Truck Isis
Publication venue: HAL CCSD
Publication date: 02/08/2010
Field of study

This document is the first external deliverable of the SALTY project (Self-Adaptive very Large disTributed sYstems), funded by the ANR under contract ANR-09-SEGI-012. It is the result of task 1.1 of the Work Package (WP) 1 : Requirements and Architecture. Its objective is to identify and collect requirements from use cases that are going to be developed in WP 4 (Use cases and Validation). Based on the study and classification of the use cases, requirements against the envisaged framework are then determined and organized in features. These features will aim at guide and control the advances in all work packages of the project. As a start, features are classified, briefly described and related scenarios in the defined use cases are pinpointed. In the following tasks and deliverables, these features will facilitate design by assigning priorities to them and defining success criteria at a finer grain as the project progresses. This report, as the first external document, has no dependency to any other external documents and serves as a reference to future external documents. As it has been built from the use cases studies that have been synthesized in two internal documents of the project, extracts from the two documents are made available as appendices (cf. appen- dices B and C)

HAL - Lille 3

HAL-UNICE

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

HAL - UPEC / UPEM

Optimization and Prediction Techniques for Self-Healing and Self-Learning Applications in a Trustworthy Cloud Continuum

Author: Alonso Juncal
Diaz de Arcaya Josu
Etxaniz Iñaki
López Lobo Jesús
Martinez Iñigo
Orue-Echevarria Leire
Osaba Eneko
Publication venue: 'MDPI AG'
Publication date: 01/07/2021
Field of study

The current IT market is more and more dominated by the “cloud continuum”. In the “traditional” cloud, computing resources are typically homogeneous in order to facilitate economies of scale. In contrast, in edge computing, computational resources are widely diverse, commonly with scarce capacities and must be managed very efficiently due to battery constraints or other limitations. A combination of resources and services at the edge (edge computing), in the core (cloud computing), and along the data path (fog computing) is needed through a trusted cloud continuum. This requires novel solutions for the creation, optimization, management, and automatic operation of such infrastructure through new approaches such as infrastructure as code (IaC). In this paper, we analyze how artificial intelligence (AI)-based techniques and tools can enhance the operation of complex applications to support the broad and multi-stage heterogeneity of the infrastructural layer in the “computing continuum” through the enhancement of IaC optimization, IaC self-learning, and IaC self-healing. To this extent, the presented work proposes a set of tools, methods, and techniques for applications’ operators to seamlessly select, combine, configure, and adapt computation resources all along the data path and support the complete service lifecycle covering: (1) optimized distributed application deployment over heterogeneous computing resources; (2) monitoring of execution platforms in real time including continuous control and trust of the infrastructural services; (3) application deployment and adaptation while optimizing the execution; and (4) application self-recovery to avoid compromising situations that may lead to an unexpected failure.This research was funded by the European project PIACERE (Horizon 2020 research and innovation Program, under grant agreement no 101000162)

Multidisciplinary Digital Publishing Institute

Directory of Open Access Journals

TECNALIA Publications

Challenges in Cybersecurity and Privacy - the European Research Landscape

Author
Publication venue: 'Informa UK Limited'
Publication date: 28/11/2022
Field of study

Cybersecurity and Privacy issues are becoming an important barrier for a trusted and dependable global digital society development. Cyber-criminals are continuously shifting their cyber-attacks specially against cyber-physical systems and IoT, since they present additional vulnerabilities due to their constrained capabilities, their unattended nature and the usage of potential untrustworthiness components. Likewise, identity-theft, fraud, personal data leakages, and other related cyber-crimes are continuously evolving, causing important damages and privacy problems for European citizens in both virtual and physical scenarios. In this context, new holistic approaches, methodologies, techniques and tools are needed to cope with those issues, and mitigate cyberattacks, by employing novel cyber-situational awareness frameworks, risk analysis and modeling, threat intelligent systems, cyber-threat information sharing methods, advanced big-data analysis techniques as well as exploiting the benefits from latest technologies such as SDN/NFV and Cloud systems. In addition, novel privacy-preserving techniques, and crypto-privacy mechanisms, identity and eID management systems, trust services, and recommendations are needed to protect citizens’ privacy while keeping usability levels. The European Commission is addressing the challenge through different means, including the Horizon 2020 Research and Innovation program, thereby financing innovative projects that can cope with the increasing cyberthreat landscape. This book introduces several cybersecurity and privacy research challenges and how they are being addressed in the scope of 15 European research projects. Each chapter is dedicated to a different funded European Research project, which aims to cope with digital security and privacy aspects, risks, threats and cybersecurity issues from a different perspective. Each chapter includes the project’s overviews and objectives, the particular challenges they are covering, research achievements on security and privacy, as well as the techniques, outcomes, and evaluations accomplished in the scope of the EU project. The book is the result of a collaborative effort among relative ongoing European Research projects in the field of privacy and security as well as related cybersecurity fields, and it is intended to explain how these projects meet the main cybersecurity and privacy challenges faced in Europe. Namely, the EU projects analyzed in the book are: ANASTACIA, SAINT, YAKSHA, FORTIKA, CYBECO, SISSDEN, CIPSEC, CS-AWARE. RED-Alert, Truessec.eu. ARIES, LIGHTest, CREDENTIAL, FutureTrust, LEPS. Challenges in Cybersecurity and Privacy - the European Research Landscape is ideal for personnel in computer/communication industries as well as academic staff and master/research students in computer science and communications networks interested in learning about cyber-security and privacy aspects

Directory of Open Access Books (DOAB)

Challenges in Cybersecurity and Privacy - the European Research Landscape

Author
Publication venue: 'Informa UK Limited'
Publication date
Field of study

OAPEN Library