Search CORE

2,334 research outputs found

Fault-tolerant behavior in state-of-the-art grid workflow management systems

Author: Fahringer T.
Fahringer T.
Kacsuk P.
Kacsuk P.
Kertesz A.
Kertesz A.
Plankensteiner K.
Plankensteiner K.
Prodan R.
Prodan R.
Publication venue: CoreGRID
Publication date: 18/10/2007
Field of study

Considering Human Aspects on Strategies for Designing and Managing Distributed Human Computation

Author: Andrade Nazareno
Brasileiro Francisco
Ponciano Lesandro
Sampaio Lívia
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 06/06/2015
Field of study

A human computation system can be viewed as a distributed system in which the processors are humans, called workers. Such systems harness the cognitive power of a group of workers connected to the Internet to execute relatively simple tasks, whose solutions, once grouped, solve a problem that systems equipped with only machines could not solve satisfactorily. Examples of such systems are Amazon Mechanical Turk and the Zooniverse platform. A human computation application comprises a group of tasks, each of them can be performed by one worker. Tasks might have dependencies among each other. In this study, we propose a theoretical framework to analyze such type of application from a distributed systems point of view. Our framework is established on three dimensions that represent different perspectives in which human computation applications can be approached: quality-of-service requirements, design and management strategies, and human aspects. By using this framework, we review human computation in the perspective of programmers seeking to improve the design of human computation applications and managers seeking to increase the effectiveness of human computation infrastructures in running such applications. In doing so, besides integrating and organizing what has been done in this direction, we also put into perspective the fact that the human aspects of the workers in such systems introduce new challenges in terms of, for example, task assignment, dependency management, and fault prevention and tolerance. We discuss how they are related to distributed systems and other areas of knowledge.Comment: 3 figures, 1 tabl

arXiv.org e-Print Archive

Springer - Publisher Connector

Perfomance Analysis and Resource Optimisation of Critical Systems Modelled by Petri Nets

Author: Júlvez Bueno Jorge Emilio
Merseguer Hernáiz José Javier
Rodríguez Fernández Ricardo Julio
Publication venue: Universidad de Zaragoza, Prensas de la Universidad
Publication date: 01/01/2013
Field of study

Un sistema crítico debe cumplir con su misión a pesar de la presencia de problemas de seguridad. Este tipo de sistemas se suele desplegar en entornos heterogéneos, donde pueden ser objeto de intentos de intrusión, robo de información confidencial u otro tipo de ataques. Los sistemas, en general, tienen que ser rediseñados después de que ocurra un incidente de seguridad, lo que puede conducir a consecuencias graves, como el enorme costo de reimplementar o reprogramar todo el sistema, así como las posibles pérdidas económicas. Así, la seguridad ha de ser concebida como una parte integral del desarrollo de sistemas y como una necesidad singular de lo que el sistema debe realizar (es decir, un requisito no funcional del sistema). Así pues, al diseñar sistemas críticos es fundamental estudiar los ataques que se pueden producir y planificar cómo reaccionar frente a ellos, con el fin de mantener el cumplimiento de requerimientos funcionales y no funcionales del sistema. A pesar de que los problemas de seguridad se consideren, también es necesario tener en cuenta los costes incurridos para garantizar un determinado nivel de seguridad en sistemas críticos. De hecho, los costes de seguridad puede ser un factor muy relevante ya que puede abarcar diferentes dimensiones, como el presupuesto, el rendimiento y la fiabilidad. Muchos de estos sistemas críticos que incorporan técnicas de tolerancia a fallos (sistemas FT) para hacer frente a las cuestiones de seguridad son sistemas complejos, que utilizan recursos que pueden estar comprometidos (es decir, pueden fallar) por la activación de los fallos y/o errores provocados por posibles ataques. Estos sistemas pueden ser modelados como sistemas de eventos discretos donde los recursos son compartidos, también llamados sistemas de asignación de recursos. Esta tesis se centra en los sistemas FT con recursos compartidos modelados mediante redes de Petri (Petri nets, PN). Estos sistemas son generalmente tan grandes que el cálculo exacto de su rendimiento se convierte en una tarea de cálculo muy compleja, debido al problema de la explosión del espacio de estados. Como resultado de ello, una tarea que requiere una exploración exhaustiva en el espacio de estados es incomputable (en un plazo prudencial) para sistemas grandes. Las principales aportaciones de esta tesis son tres. Primero, se ofrecen diferentes modelos, usando el Lenguaje Unificado de Modelado (Unified Modelling Language, UML) y las redes de Petri, que ayudan a incorporar las cuestiones de seguridad y tolerancia a fallos en primer plano durante la fase de diseño de los sistemas, permitiendo así, por ejemplo, el análisis del compromiso entre seguridad y rendimiento. En segundo lugar, se proporcionan varios algoritmos para calcular el rendimiento (también bajo condiciones de fallo) mediante el cálculo de cotas de rendimiento superiores, evitando así el problema de la explosión del espacio de estados. Por último, se proporcionan algoritmos para calcular cómo compensar la degradación de rendimiento que se produce ante una situación inesperada en un sistema con tolerancia a fallos

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Universidad de Zaragoza

Review and Analysis of Failure Detection and Prevention Techniques in IT Infrastructure Monitoring

Author: Bhanage Deepali Arun
Kotecha K
Pawar Ambika Vishal
Publication venue: DigitalCommons@University of Nebraska - Lincoln
Publication date: 04/03/2021
Field of study

Maintaining the health of IT infrastructure components for improved reliability and availability is a research and innovation topic for many years. Identification and handling of failures are crucial and challenging due to the complexity of IT infrastructure. System logs are the primary source of information to diagnose and fix failures. In this work, we address three essential research dimensions about failures, such as the need for failure handling in IT infrastructure, understanding the contribution of system-generated log in failure detection and reactive & proactive approaches used to deal with failure situations. This study performs a comprehensive analysis of existing literature by considering three prominent aspects as log preprocessing, anomaly & failure detection, and failure prevention. With this coherent review, we (1) presume the need for IT infrastructure monitoring to avoid downtime, (2) examine the three types of approaches for anomaly and failure detection such as a rule-based, correlation method and classification, and (3) fabricate the recommendations for researchers on further research guidelines. As far as the authors\u27 knowledge, this is the first comprehensive literature review on IT infrastructure monitoring techniques. The review has been conducted with the help of meta-analysis and comparative study of machine learning and deep learning techniques. This work aims to outline significant research gaps in the area of IT infrastructure failure detection. This work will help future researchers understand the advantages and limitations of current methods and select an adequate approach to their problem

DigitalCommons@University of Nebraska

DCDIDP: A distributed, collaborative, and data-driven intrusion detection and prevention framework for cloud computing environments

Author: Joshi JBD
Takabi H
Zargar ST
Publication venue: 'European Alliance for Innovation n.o.'
Publication date: 01/01/2011
Field of study

With the growing popularity of cloud computing, the exploitation of possible vulnerabilities grows at the same pace; the distributed nature of the cloud makes it an attractive target for potential intruders. Despite security issues delaying its adoption, cloud computing has already become an unstoppable force; thus, security mechanisms to ensure its secure adoption are an immediate need. Here, we focus on intrusion detection and prevention systems (IDPSs) to defend against the intruders. In this paper, we propose a Distributed, Collaborative, and Data-driven Intrusion Detection and Prevention system (DCDIDP). Its goal is to make use of the resources in the cloud and provide a holistic IDPS for all cloud service providers which collaborate with other peers in a distributed manner at different architectural levels to respond to attacks. We present the DCDIDP framework, whose infrastructure level is composed of three logical layers: network, host, and global as well as platform and software levels. Then, we review its components and discuss some existing approaches to be used for the modules in our proposed framework. Furthermore, we discuss developing a comprehensive trust management framework to support the establishment and evolution of trust among different cloud service providers. © 2011 ICST

Crossref

D-Scholarship@Pitt

Using simple PID-inspired controllers for online resilient resource management of distributed scientific workflows

Author: Atkinson Malcolm P.
Deelman Ewa
Filgueira Rosa
Overton Ian M.
Pairo-Castineira Erola
Silva Rafael Ferreira da
Publication venue: 'Elsevier BV'
Publication date: 01/06/2019
Field of study

Scientific workflows have become mainstream for conducting large-scale scientific research. As a result, many workflow applications and Workflow Management Systems (WMSs) have been developed as part of the cyberinfrastructure to allow scientists to execute their applications seamlessly on a range of distributed platforms. Although the scientific community has addressed this challenge from both theoretical and practical approaches, failure prediction, detection, and recovery still raise many research questions. In this paper, we propose an approach inspired by the control theory developed as part of autonomic computing to predict failures before they happen, and mitigated them when possible. The proposed approach is inspired on the proportional–integral–derivative controller (PID controller) control loop mechanism, which is widely used in industrial control systems, where the controller will react to adjust its output to mitigate faults. PID controllers aim to detect the possibility of a non-steady state far enough in advance so that an action can be performed to prevent it from happening. To demonstrate the feasibility of the approach, we tackle two common execution faults of large scale data-intensive workflows—data storage overload and memory overflow. We developed a simulator, which implements and evaluates simple standalone PID-inspired controllers to autonomously manage data and memory usage of a data-intensive bioinformatics workflow that consumes/produces over 4.4 TB of data, and requires over 24 TB of memory to run all tasks concurrently. Experimental results obtained via simulation indicate that workflow executions may significantly benefit from the controller-inspired approach, in particular under online and unknown conditions. Simulation results show that nearly-optimal executions (slowdown of 1.01) can be attained when using our proposed method, and faults are detected and mitigated far in advance of their occurrence

Queen's University Belfast Research Portal

Heriot Watt Pure

Edinburgh Research Explorer

University of St. Andrews - Pure

NERC Open Research Archive

FATMAS: a methodology to design fault-tolerant multi-agent systems

Author: Mellouli Sehl
Publication venue: Bibliotheque de l' Universite Laval
Publication date: 01/01/2005
Field of study

Un système multi-agent (SMA) est un système dans lequel plusieurs agents opèrent et interagissent. Chaque agent a la responsabilité d’exécuter des tâches. Cependant, chaque agent, pour diverses raisons, peut rencontrer des problèmes pendant l’exécution de ses tâches ; ce qui peut induire un disfonctionnement du SMA. Cependant, le SMA doit être en mesure de détecter les sources de problèms (d’erreurs) afin de les contrôler et ainsi continuer son exécution correctement. Un tel SMA est appelé un SMA tolérant aux fautes. Il existe deux types de sources d’erreurs pour un agent : les erreurs causées par son environnment et les erreurs dûes à sa programmation. Dans la littérature, il existe plusieurs techniques qui traitent des erreurs de programmation au niveau des agents. Cependant, ces techniques ne traitent pas des erreurs causées par l’environnement de l’agent. Tout d’abord, nous distinguons entre l’environnment d’un agent et l’environnement du SMA. L’environnement d’un agent représente toutes les composantes matérielles ou logicielles que l’agent ne peut contrôler mais avec lesquelles il interagit. Cependant, l’environnment du SMA représente toutes les composantes que le système ne contrôle pas mais avec lesquelles il interagit. Ainsi, le SMA peut contrôler certaines des composantes avec lesquelles un agent interagit. Ainsi, une composante peut appartenir à l’environnement d’un agent et ne pas appartenir à l’environnement du système. Dans ce travail, nous présentons une méthodologie de conception de SMA tolérants aux fautes, nommée FATMAS, qui permet au concepteur du SMA de détecter et de corriger, si possible, les erreurs causées par les environnements des agents. Cette méthodologie permettra ainsi de délimiter la frontière du SMA de son environnement avec lequel il interagit. La frontière du SMA est déterminée par les différentes composantes (matérielles ou logicielles) que le système contrôle. Ainsi, le SMA, à l’intérieur de sa frontière, peut corriger les erreurs provenant de ses composantes. Cependant, le SMA n’a aucun contrôle sur toutes les composantes opérant dans son environnement. La méthodologie, que nous proposons, doit couvrir les trois premières phases d’un développement logiciel qui sont l’analyse, la conception et l’implémentation tout en intégrant, dans son processus de développement, une technique permettant au concepteur du système de délimiter la frontière du SMA et ainsi détecter les sources d’erreurs et les contrôler afin que le système multi-agent soit tolérant aux fautes (SMATF). Cependant, les méthodologies de conception de SMA, référencées dans la littérature, n’intègrent pas une telle technique. FATMAS offre au concepteur du SMATF quatre modèles pour décrire et développer le SMA ainsi qu’une technique de réorganisation du système qui lui permet de détecter et de contrôler ses sources d’erreurs, et ainsi définir la frontière du SMA. Chaque modèle est associé à un micro processus qui guide le concepteur lors du développement du modèle. FATMAS offre aussi un macro-processus, qui définit le cycle de développement de la méthodologie. FATMAS se base sur un développement itératif pour identifier et déterminer les tâches à ajouter au système afin de contrôler des sources d’erreurs. À chaque itération, le concepteur évalue, selon une fonction de coût/bénéfice s’il est opportun d’ajouter de nouvelles tâches de contrôle au système. Le premier modèle est le modèle de tâches-environnement. Il est développé lors de la phase d’analyse. Il identifie les différentes tâches que les agents doivent exécuter, leurs préconditions et leurs ressources. Ce modèle permet d’identifier différentes sources de problèmes qui peuvent causer un disfonctionnement du système. Le deuxième modèle est le modèle d’agents. Il est développé lors de la phase de conception. Il décrit les agents, leurs relations, et spécifie pour chaque agent les ressources auxquelles il a le droit d’accéder. Chaque agent exécutera un ensemble de tâches identifiées dans le modèle de tâches-environnement. Le troisième modèle est le modèle d’interaction d’agents. Il est développé lors de la phase de conception. Il décrit les échanges de messages entre les agents. Le quatrième modèle est le modèle d’implémentation. Il est développé lors de la phase d’implémentation. Il décrit l’infrastructure matérielle sur laquelle le SMA va opérer ainsi que l’environnement de développement du SMA. La méthodologie inclut aussi une technique de réorganisation. Cette technique permet de délimiter la frontière du SMA et contrôler, si possible, ses sources d’erreurs. Cette technique doit intégrer trois techniques nécessaires à la conception d’un système tolérant aux fautes : une technique de prévention d’erreurs, une technique de recouvrement d’erreurs, et une technique de tolérance aux fautes. La technique de prévention d’erreurs permet de délimiter la frontière du SMA. La technique de recouvrement d’erreurs permet de proposer une architecture du SMA pour détecter les erreurs. La technique de tolérance aux fautes permet de définir une procédure de réplication d’agents et de tâches dans le SMA pour que le SMA soit tolérant aux fautes. Cette dernière technique, à l’inverse des techniques de tolérance aux fautes existantes, réplique les tâches et les agents et non seulement les agents. Elle permet ainsi de réduire la complexité du système en diminuant le nombre d’agents à répliquer. Résumé iv De même, un agent peut ne pas être en erreur mais la composante matérielle sur laquelle il est exécuté peut ne plus être fonctionnelle. Ce qui constitue une source d’erreurs pour le SMA. Il faudrait alors que le SMA continue à s’exécuter correctement malgrè le disfonctionnement d’une composante. FATMAS fournit alors un support au concepteur du système pour tenir compte de ce type d’erreurs soit en contrôlant les composantes matérielles, soit en proposant une distribution possible des agents sur les composantes matérielles disponibles pour que le disfonctionnement d’une composante matérielle n’affecte pas le fonctionnement du SMA. FATMAS permet d’identifier des sources d’erreurs lors de la phase de conception du système. Cependant, elle ne traite pas des sources d’erreurs de programmation. Ainsi, la technique de réorganization proposée dans ce travail sera validée par rapport aux sources d’erreurs identifiées lors de la phase de conception et provenant de la frontière du SMA. Nous démontrerons formellement que, si une erreur provient d’une composante que le SMA contrôle, le SMA devrait être opérationnel. Cependant, FATMAS ne certifie pas que le futur système sera toujours opérationnel car elle ne traîte pas des erreurs de programmation ou des erreurs causées par son environnement.A multi-agent system (MAS) consists of several agents interacting together. In a MAS, each agent performs several tasks. However, each agent is prone to individual failures so that it can no longer perform its tasks. This can lead the MAS to a failure. Ideally, the MAS should be able to identify the possible sources of failures and try to overcome them in order to continue operating correctly ; we say that it should be fault-tolerant. There are two kinds of sources of failures to an agent : errors originating from the environment with which the agents interacts, and programming exceptions. There are several works on fault-tolerant systems which deals with programming exceptions. However, these techniques does not allow the MAS to identify errors originating from an agent’s environment. In this thesis, we propose a design methodology, called FATMAS, which allows a MAS designer to identify errors originating from agents’ environments. Doing so, the designer can determine the sources of failures it could be able to control and those it could not. Hence, it can determine the errors it can prevent and those it cannot. Consequently, this allows the designer to determine the system’s boundary from its environment. The system boundary is the area within which the decision-taking process of the MAS has power to make things happen, or prevent them from happening.We distinguish between the system’s environment and an agent’s environment. An agent’s environment is characterized by the components (hardware or software) that the agent does not control. However, the system may control some of the agent’s environment components. Consequently, some of the agent’s environment components may not be a part of the system’s environment. The development of a fault-tolerant MAS (FTMAS) requires the use of a methodology to design FTMAS and of a reorganization technique that will allow the MAS designer to identify and control, if possible, different sources of system failure. However, current MAS design methodologies do not integrate such a technique. FATMAS provides four models used to design and implement the target system and a reorganization technique to assist the designer in identifying and controlling different sources of system’s failures. FATMAS also provides a macro process which covers the entire life cycle of the system development as well as several micro processes that guide the designer when developing each model. The macro-process is based on an iterative approach based on a cost/benefit evaluation to help the designer to decide whether to go from one iteration to another. The methodology has three phases : analysis, design, and implementation. The analysis phase develops the task-environment model. This model identifies the different tasks the agents will perform, their resources, and their preconditions. It identifies several possible sources of system failures. The design phase develops the agent model and the agent interaction model. The agent model describes the agents and their resources. Each agent performs several tasks identified in the task-environment model. The agent interaction model describes the messages exchange between agents. The implementation phase develops the implementation model, and allows an automatic code generation of Java agents. The implementation model describes the infrastructure upon which the MAS will operate and the development environment to be used when developing the MAS. The reorganization technique includes three techniques required to design a fault-tolerant system : a fault-prevention technique, a fault-recovery technique, and a fault-tolerance technique. The fault-prevention technique assists the designer in delimiting the system’s boundary. The fault-recovery technique proposes a MAS architecture allowing it to detect failures. The fault-tolerance technique is based on agent and task redundancy. Contrary to existing fault-tolerance techniques, this technique replicates tasks and agents and not only agents. Thus, it minimizes the system complexity by minimizing the number of agents operating in the system. Furthermore, FATMAS helps the designer to deal with possible physical component failures, on which the MAS will operate. It proposes a way to either control these components or to distribute the agents on these components in such a way that if a component is in failure, then the MAS could continue operating properly. The FATMAS methodology presented in this dissertation assists a designer, in its development process, to build fault-tolerant systems. It has the following main contributions : 1. it allows to identify different sources of system failure ; 2. it proposes to introduce new tasks in a MAS to control the identified sources of failures ; 3. it proposes a mechanism which automatically determines which tasks (agents) should be replicated and in which other agents ; 4. it reduces the system complexity by minimizing the replication of agents ; Abstract vii 5. it proposes a MAS reorganization technique which is embedded within the designed MAS and assists the designer to determine the system’s boundary. It proposes a MAS architecture to detect and recover from failures originating from the system boundary. Moreover, it proposes a way to distribute agents on the physical components so that the MAS could continue operating properly in case of a component failure. This could make the MAS more robust to fault prone environments. FATMAS alows to determine different sources of failures of a MAS. The MAS controls the sources of failures situated in its boundary. It does not control the sources of failures situated in its environments. Consequently, the reorganization technique proposed in this dissertation will be proven valid only in the case where the sources of failures are controlled by the MAS. However, it cannot be proven that the future system is fault-tolerant since faults originating from the environment or from coding are not dealt with

CorpusUL

Digital provenance - models, systems, and applications

Author: Sultana Salmin
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2000
Field of study

Data provenance refers to the history of creation and manipulation of a data object and is being widely used in various application domains including scientific experiments, grid computing, file and storage system, streaming data etc. However, existing provenance systems operate at a single layer of abstraction (workflow/process/OS) at which they record and store provenance whereas the provenance captured from different layers provide the highest benefit when integrated through a unified provenance framework. To build such a framework, a comprehensive provenance model able to represent the provenance of data objects with various semantics and granularity is the first step. In this thesis, we propose a such a comprehensive provenance model and present an abstract schema of the model. ^ We further explore the secure provenance solutions for distributed systems, namely streaming data, wireless sensor networks (WSNs) and virtualized environments. We design a customizable file provenance system with an application to the provenance infrastructure for virtualized environments. The system supports automatic collection and management of file provenance metadata, characterized by our provenance model. Based on the proposed provenance framework, we devise a mechanism for detecting data exfiltration attack in a file system. We then move to the direction of secure provenance communication in streaming environment and propose two secure provenance schemes focusing on WSNs. The basic provenance scheme is extended in order to detect packet dropping adversaries on the data flow path over a period of time. We also consider the issue of attack recovery and present an extensive incident response and prevention system specifically designed for WSNs

Purdue E-Pubs

HAL-Rennes 1