6 research outputs found

    Lifeguard: Local Health Awareness for More Accurate Failure Detection

    Full text link
    SWIM is a peer-to-peer group membership protocol with attractive scaling and robustness properties. However, slow message processing can cause SWIM to mark healthy members as failed (so called false positive failure detection), despite inclusion of a mechanism to avoid this. We identify the properties of SWIM that lead to the problem, and propose Lifeguard, a set of extensions to SWIM which consider that the local failure detector module may be at fault, via the concept of local health. We evaluate this approach in a precisely controlled environment and validate it in a real-world scenario, showing that it drastically reduces the rate of false positives. The false positive rate and detection time for true failures can be reduced simultaneously, compared to the baseline levels of SWIM

    Impact FD: An Unreliable Failure Detector Based on Process Relevance and Confidence in the System

    Get PDF
    International audienceThis paper presents a new unreliable failure detector, called the Impact failure detector (FD) that, contrarily to the majority of traditional FDs, outputs a trust level value which expresses the degree of confidence in the system. An impact factor is assigned to each process and the trust level is equal to the sum of the impact factors of the processes not suspected of failure. Moreover, a threshold parameter defines a lower bound value for the trust level, over which the confidence in the system is ensured. In particular, we defined a f l exi bi l i t y property that denotes the capacity of the Impact FD to tolerate a certain margin of failures or false suspicions, i.e., its capacity of considering different sets of responses that lead the system to trusted states. The Impact FD is suitable for systems that present node redundancy, heterogeneity of nodes, clustering feature, and allow a margin of failures which does not degrade the confidence in the system. The paper also includes a timer-based distributed algorithm which implements an Impact FD, as well as its proof of correctness, for systems whose links are lossy asynchronous or for those whose all (or some) links are eventually timely. Performance evaluation results, based on PlanetLab [1] traces, confirm the degree of flexible applicability of our failure detector and that, due to the accepted margin of failure, both failures and false suspicions are more tolerated when compared to traditional unreliable failure detectors

    A new adaptive accrual failure detector for dependable distributed systems

    No full text
    A new adaptive accrual failure detector for dependable distributed systems / T. Ungerer ... - In: l> : [Elektronische Ressource] : Seoul, Korea, March 11 - 15, 2007 / hosted by Seoul National University in Seoul ... - New York : ACM, 2007. - S. 551-555. - 1 CD-ROM

    Support for dependable and adaptive distributed systems and applications

    Get PDF
    Tese de doutoramento, Informática (Engenharia Informática), Universidade de Lisboa, Faculdade de Ciências, 2011Distributed applications executing in uncertain environments, like the Internet, need to make timing/synchrony assumptions (for instance, about the maximum message transmission delay), in order to make progress. In the case of adaptive systems these temporal bounds should be computed at runtime, using probabilistic or specifically designed ad hoc approaches, typically with the objective of improving the application performance. From a dependability perspective, however, the concern is to secure some properties on which the application can rely. This thesis addresses the problem of supporting adaptive systems and applications in stochastic environments, from a dependability perspective: maintaining the correctness of system properties after adaptation. The idea behind dependable adaptation consists in ensuring that the assumed bounds for fundamental variables (e.g., network delays) are secured with a known and constant probability. Assuming that during its lifetime a system alternates periods where its temporal behavior is well characterized (stable phases), with transition periods where a variation of the network conditions occurs (transient phases), the proposed approach is based on the following: if the environment is generically characterized in analytical terms and it is possible to detect the alternation of these stable and transient phases, then it is possible to effectively and dependably adapt applications. Based on this idea, the thesis introduces Adaptare, a framework for supporting dependable adaptation in stochastic environments. An extensive evaluation of Adaptare is provided, assessing the correctness and effectiveness of the implemented mechanisms. The results indicate that the proposed strategies and methodologies are indeed effective to support dependable adaptation of distributed systems and applications. Finally, the applicability of Adaptare is evaluated in the context of two fundamental problems in distributed systems: consensus and failure detection. The thesis proposes solutions for these problems based on modular architectures in which Adaptare is used as a middleware for dependable adaptation of assumed timeouts.Aplicações distribuídas que executam em ambientes incertos, como a Internet, baseiam-se em pressupostos sobre tempo/sincronia (por exemplo, assumem um tempo máximo para a transmissão de mensagens) a fim de assegurar progresso. No caso de sistemas adaptativos, esses limites temporais devem ser calculados em tempo de execução, usando abordagens probabilísticas ou desenhadas de forma específica e ad hoc, tipicamente visando melhorar o desempenho da aplicação. Sob o ponto de vista da confiabilidade, no entanto, o objetivo é garantir algumas propriedades nas quais a aplicação pode confiar. Esta tese aborda o problema de suportar sistemas adaptativos e aplicações que operam em ambientes estocásticos, numa perspectiva de confiabilidade: mantendo a correção das propriedades do sistema após a adaptação. A ideia da adaptação confiável consiste em garantir que os limites assumidos para variáveis fundamentais (por exemplo, latências de transmissão) são assegurados com uma probabilidade conhecida e constante. Supondo que durante a execução o sistema alterna períodos nos quais o seu comportamento temporal é bem caracterizado (fases estáveis), com períodos de transição durante os quais ocorrem variações das condições da rede (fases transientes), a abordagem proposta baseia-se no seguinte: se o ambiente é genericamente caracterizado em termos analíticos e é possível detetar a alternância entre fases estáveis e transientes, então é possível adaptar as aplicações de forma efetiva e confiável. Com base nesta ideia, a tese apresenta uma plataforma para suportar a adaptação confiável em ambientes estocásticos, denominada Adaptare. A tese contém uma extensa avaliação do Adaptare, que foi realizada para verificar a correção e eficácia dos mecanismos desenvolvidos. Os resultados indicam que as estratégias e metodologias propostas são de facto efetivas para suportar a adaptação confiável de sistemas e aplicações distribuídas. Finalmente, a aplicabilidade do Adaptare é avaliada no contexto de dois problemas fundamentais em sistemas distribuídos: consenso e deteção de falhas. A tese propõe soluções para estes problemas baseadas em arquiteturas modulares nas quais o Adaptare é usado como um middleware para a adaptação confiável de timeouts.Fundação para a Ciência e a Tecnologia (FCT
    corecore