1,163 research outputs found

    Reliable Messaging to Millions of Users with MigratoryData

    Full text link
    Web-based notification services are used by a large range of businesses to selectively distribute live updates to customers, following the publish/subscribe (pub/sub) model. Typical deployments can involve millions of subscribers expecting ordering and delivery guarantees together with low latencies. Notification services must be vertically and horizontally scalable, and adopt replication to provide a reliable service. We report our experience building and operating MigratoryData, a highly-scalable notification service. We discuss the typical requirements of MigratoryData customers, and describe the architecture and design of the service, focusing on scalability and fault tolerance. Our evaluation demonstrates the ability of MigratoryData to handle millions of concurrent connections and support a reliable notification service despite server failures and network disconnections

    A runtime heuristic to selectively replicate tasks for application-specific reliability targets

    Get PDF
    In this paper we propose a runtime-based selective task replication technique for task-parallel high performance computing applications. Our selective task replication technique is automatic and does not require modification/recompilation of OS, compiler or application code. Our heuristic, we call App_FIT, selects tasks to replicate such that the specified reliability target for an application is achieved. In our experimental evaluation, we show that App FIT selective replication heuristic is low-overhead and highly scalable. In addition, results indicate that complete task replication is overkill for achieving reliability targets. We show that with App FIT, we can tolerate pessimistic exascale error rates with only 53% of the tasks being replicated.This work was supported by FI-DGR 2013 scholarship and the European Community’s Seventh Framework Programme [FP7/2007-2013] under the Mont-blanc 2 Project (www.montblanc-project.eu), grant agreement no. 610402 and in part by the European Union (FEDER funds) under contract TIN2015-65316-P.Peer ReviewedPostprint (author's final draft

    Quality of Service of Crash-Recovery Failure Detectors

    Get PDF
    This thesis presents the results of an investigation into the failure detection problem. We consider the specific case of the Quality of Service (QoS) of crash failure detection. In contrast to previous work, we address the crash failure detection problem when the monitored target is resilient and recovers after failure. To the best of our knowledge, this is the first work to provide an analysis of crash-recovery failure detection from the QoS perspective.We develop a probabilistic model of the behavior of a crash-recovery target, i.e. one which has the ability to recover from the crash state. We show that the fail-free run and the crash-stop run are special cases of the crash-recovery run with mean time to failure (MTTF) approaching to infinity and mean time to recovery (MTTR) approaching to infinity, respectively. We extend the previously published QoS metrics to allow the measurement of the recovery speed, and the definition of the completeness property of a failure detector. Then, the impact of the dependability of the crash-recovery target on the QoS bounds for such a crash-recovery failure detector is analyzed using general dependability metrics, such as MTTF and MTTR, based on an approximate probabilistic model of the two-process failure detection system. Then according to our approximate model, we show how to estimate the failure detector’s parameters to achieve a required QoS, based on Chen et al.’s NFD-S algorithm analytically, and how to execute the configuration procedure of this crash-recovery failure detector.In order to make the failure detector adaptive to the target’s crash-recovery behavior and enable the autonomy of the monitoring procedure, we propose two types of recovery detection protocols. One is a reliable recovery detection protocol, which can guarantee to detect each occurring failure and recovery by adopting persistent storage. The other is a lightweight recovery detection protocol, which does not guarantee to detect every failure and recovery but which reduces the system overhead. Both of these recovery detection protocols improve the completeness without reducing the other QoS aspects of a failure detector. In addition, we also demonstrate how to estimate the inputs, such as the dependability metrics, using the failure detector itself.In order to evaluate our analytical work, we simulate the following failure detection algorithms: the simple heartbeat timeout algorithm, the NFD-S algorithm and the NFDS algorithm with the lightweight recovery detection protocol, for various values of MTTF and MTTR. The simulation results show that the dependability of a recoverable monitored target could have significant impact on the QoS of such a failure detector. This conforms well to our models and analysis. We show that in the case of reasonable long MTTF, the NFD-S algorithm with the lightweight recovery detection protocol exhibits better QoS than the NFD-S algorithm for the completeness of a crash-recovery failure detector, and similarly for other QoS metrics

    Fit-Broker: delivering a reliable service for event dissemination

    Get PDF
    Tese de mestrado em Segurança da Informação, apresentada à Universidade de Lisboa, através da Faculdade de Ciências, 2013Os serviços de nuvem (Cloud) estão a assumir um papel cada vez mais importante no mundo de fornecimento de serviços. Estes serviços variam desde a oferta de simples ferramentas de trabalho até a disponibilização de infraestruturas remotas de computação. Como tal, a correcta monitorização das infraestruturas de nuvem assume um papel vital de forma a garantir disponibilidade e o cumprimento de acordos de nível de serviço. Existem alguns estudos recentes que mostram que este tipo de infraestruturas não se encontra preparada para enfrentar atuais e futuros problemas de segurança que podem ocorrer. Parte deste problema advém do facto de as ferramentas de monitorização serem centralizadas e de apenas suportarem alguns tipos de falhas. De forma a tornar os sistemas de monitorização mais resilientes, esta dissertação propõe uma solução para aumentar a confiabilidade no transporte de informação entre os seus vários pontos. Trata-se de uma framework adaptável e resiliente de disseminação de eventos baseada no paradigma de publicador-subscritor. Esta oferece múltiplos níveis de resiliência e qualidades de serviço que podem ser combinados para oferecer uma qualidade de serviço e de proteção adequada às necessidades de cada sistema. Este documento descreve a arquitectura da framework bem como todo seu funcionamento interno e interfaces oferecidas. Este documento descreve ainda um conjunto de testes realizados de forma a avaliar a performance da framework em vários cenários distintos.Cloud services are assuming a greater role in the world of service providing. These services can range from the simple working tool to a complete remote computing infrastructure. As such, the correct monitoring of this type of infrastructures represents a key requirement to ensure availability and the fulfilment of the service level agreements. Recent studies show that these infrastructures are not prepared to face some current and future security issues. Part of these problems resides in the fact that current monitoring tools are centralized and are only prepared to deal with some types of faults. In order to increase the resilience of monitoring systems, this dissertation proposes a framework capable of increasing the reliability of the transport of information between their many peers. It is a adaptable and resilient framework for event dissemination based on the publisher-subscriber paradigm. The framework offers multiple levels of resilience and quality of services that can be combined to meet the necessities of quality of service and protection of each system. This document describes the architecture, internal mechanism and interfaces of the framework. Also, we describe a series of tests that where used to evaluate the performance of the framework in different scenarios

    Dependable IMS services - A Performance Analysis of Server Replication and Mid-Session Inter-Domain Handover

    Get PDF

    Proactive software rejuvenation solution for web enviroments on virtualized platforms

    Get PDF
    The availability of the Information Technologies for everything, from everywhere, at all times is a growing requirement. We use information Technologies from common and social tasks to critical tasks like managing nuclear power plants or even the International Space Station (ISS). However, the availability of IT infrastructures is still a huge challenge nowadays. In a quick look around news, we can find reports of corporate outage, affecting millions of users and impacting on the revenue and image of the companies. It is well known that, currently, computer system outages are more often due to software faults, than hardware faults. Several studies have reported that one of the causes of unplanned software outages is the software aging phenomenon. This term refers to the accumulation of errors, usually causing resource contention, during long running application executions, like web applications, which normally cause applications/systems to hang or crash. Gradual performance degradation could also accompany software aging phenomena. The software aging phenomena are often related to memory bloating/ leaks, unterminated threads, data corruption, unreleased file-locks or overruns. We can find several examples of software aging in the industry. The work presented in this thesis aims to offer a proactive and predictive software rejuvenation solution for Internet Services against software aging caused by resource exhaustion. To this end, we first present a threshold based proactive rejuvenation to avoid the consequences of software aging. This first approach has some limitations, but the most important of them it is the need to know a priori the resource or resources involved in the crash and the critical condition values. Moreover, we need some expertise to fix the threshold value to trigger the rejuvenation action. Due to these limitations, we have evaluated the use of Machine Learning to overcome the weaknesses of our first approach to obtain a proactive and predictive solution. Finally, the current and increasing tendency to use virtualization technologies to improve the resource utilization has made traditional data centers turn into virtualized data centers or platforms. We have used a Mathematical Programming approach to virtual machine allocation and migration to optimize the resources, accepting as many services as possible on the platform while at the same time, guaranteeing the availability (via our software rejuvenation proposal) of the services deployed against the software aging phenomena. The thesis is supported by an exhaustive experimental evaluation that proves the effectiveness and feasibility of our proposals for current systems

    Autonomous Fault Detection in Self-Healing Systems using Restricted Boltzmann Machines

    Get PDF
    Autonomously detecting and recovering from faults is one approach for reducing the operational complexity and costs associated with managing computing environments. We present a novel methodology for autonomously generating investigation leads that help identify systems faults, and extends our previous work in this area by leveraging Restricted Boltzmann Machines (RBMs) and contrastive divergence learning to analyse changes in historical feature data. This allows us to heuristically identify the root cause of a fault, and demonstrate an improvement to the state of the art by showing feature data can be predicted heuristically beyond a single instance to include entire sequences of information.Comment: Published and presented in the 11th IEEE International Conference and Workshops on Engineering of Autonomic and Autonomous Systems (EASe 2014
    corecore