Search CORE

133 research outputs found

Resource Allocation Optimization through Task Based Scheduling Algorithms in Distributed Real Time Embedded Systems

Author: Mishra Swagat
Publication venue
Publication date: 17/05/2011
Field of study

Distributed embedded system is a type of distributed system, which consists of a large number of nodes, each node having lower computational power when compared to a node of a regular distributed system (like a cluster). A real time system is the one where every task has an associated dead line and the system works with a continuous stream of data supplied in real time.Such systems find wide applications in various fields such as automobile industry as fly-by-wire,brake-by-wire and steer-by-wire systems. Scheduling and efficient allocation of resources is extremely important in such systems because a distributed embedded real time system must deliver its output within a certain time frame, failing which the output becomes useless.In this paper, we have taken up processing unit number as a resource and have optimized the allocation of it to the various tasks.We use techniques such as model-based redundancy,heartbeat monitoring and check-pointing for fault detection and failure recovery.Our fault tolerance framework uses an existing list-based scheduling algorithm for task scheduling.This helps in diagnosis and shutting down of faulty actuators before the system becomes unsafe. The framework is designed and tested using a new simulation model consisting of virtual nodes working on a message passing system

ethesis@nitr

QoS Self-configuring Failure Detectors for Distributed Systems

Author: F. Cristian
F.R.L. Lima
J.L. Hellerstein
K. Birman
K. Mills
K. Ogata
K.C.W. So
L. Falai
M. Bertier
M.J. Fischer
R.C. Nunes
T.D. Chandra
W. Chen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Crossref

Middleware-based Database Replication: The Gaps between Theory and Practice

Author: Ailamaki Anastasia
Candea George
Cecchet Emmanuel
Publication venue
Publication date: 01/01/2008
Field of study

The need for high availability and performance in data management systems has been fueling a long running interest in database replication from both academia and industry. However, academic groups often attack replication problems in isolation, overlooking the need for completeness in their solutions, while commercial teams take a holistic approach that often misses opportunities for fundamental innovation. This has created over time a gap between academic research and industrial practice. This paper aims to characterize the gap along three axes: performance, availability, and administration. We build on our own experience developing and deploying replication systems in commercial and academic settings, as well as on a large body of prior related work. We sift through representative examples from the last decade of open-source, academic, and commercial database replication systems and combine this material with case studies from real systems deployed at Fortune 500 customers. We propose two agendas, one for academic research and one for industrial R&D, which we believe can bridge the gap within 5-10 years. This way, we hope to both motivate and help researchers in making the theory and practice of middleware-based database replication more relevant to each other.Comment: 14 pages. Appears in Proc. ACM SIGMOD International Conference on Management of Data, Vancouver, Canada, June 200

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

CiteSeerX

Quality of Service of Crash-Recovery Failure Detectors

Author: Ma Tiejun
Publication venue
Publication date: 01/01/2007
Field of study

This thesis presents the results of an investigation into the failure detection problem. We consider the specific case of the Quality of Service (QoS) of crash failure detection. In contrast to previous work, we address the crash failure detection problem when the monitored target is resilient and recovers after failure. To the best of our knowledge, this is the first work to provide an analysis of crash-recovery failure detection from the QoS perspective.We develop a probabilistic model of the behavior of a crash-recovery target, i.e. one which has the ability to recover from the crash state. We show that the fail-free run and the crash-stop run are special cases of the crash-recovery run with mean time to failure (MTTF) approaching to infinity and mean time to recovery (MTTR) approaching to infinity, respectively. We extend the previously published QoS metrics to allow the measurement of the recovery speed, and the definition of the completeness property of a failure detector. Then, the impact of the dependability of the crash-recovery target on the QoS bounds for such a crash-recovery failure detector is analyzed using general dependability metrics, such as MTTF and MTTR, based on an approximate probabilistic model of the two-process failure detection system. Then according to our approximate model, we show how to estimate the failure detector’s parameters to achieve a required QoS, based on Chen et al.’s NFD-S algorithm analytically, and how to execute the configuration procedure of this crash-recovery failure detector.In order to make the failure detector adaptive to the target’s crash-recovery behavior and enable the autonomy of the monitoring procedure, we propose two types of recovery detection protocols. One is a reliable recovery detection protocol, which can guarantee to detect each occurring failure and recovery by adopting persistent storage. The other is a lightweight recovery detection protocol, which does not guarantee to detect every failure and recovery but which reduces the system overhead. Both of these recovery detection protocols improve the completeness without reducing the other QoS aspects of a failure detector. In addition, we also demonstrate how to estimate the inputs, such as the dependability metrics, using the failure detector itself.In order to evaluate our analytical work, we simulate the following failure detection algorithms: the simple heartbeat timeout algorithm, the NFD-S algorithm and the NFDS algorithm with the lightweight recovery detection protocol, for various values of MTTF and MTTR. The simulation results show that the dependability of a recoverable monitored target could have significant impact on the QoS of such a failure detector. This conforms well to our models and analysis. We show that in the case of reasonable long MTTF, the NFD-S algorithm with the lightweight recovery detection protocol exhibits better QoS than the NFD-S algorithm for the completeness of a crash-recovery failure detector, and similarly for other QoS metrics

CiteSeerX

Edinburgh Research Archive

Knowing your body and being compassionate with yourself

Author: Campion Maxine
Publication venue
Publication date: 01/06/2015
Field of study

This portfolio thesis consists of four parts, a systematic literature review; a mixed methods empirical paper; a qualitative empirical paper; and supporting appendices.Part One is a systematic review of the literature regarding how body awareness can affect well‐being. This review stems from ideas of embodiment and reciprocal influences and connections between mind and body.The concept of embodiment also informed the development of an empirical investigation into the impact of compassionate imagery on affect and self‐compassion in a non‐clinical sample. This empirical study is divided into two papers, presented in Parts Two and Three. Part Two uses mixed methods to quantify the extent to which psychoeducation and meditative exercises can alter affect and self‐compassion. Additionally, participants’ interview responses to this experience are presented to add depth to the understanding of the clinical relevance of this practice.Part Three offers a qualitative study to explore reactions to the concept of self-compassion in a non‐clinical sample. Participants emphasised the role of culture and systemic influences on their perceptions of their capacity to be self‐compassionate and the paper presents a brief exploration into possible reasons for this and methods by which these barriers could be overcome in order to promote well‐being.Part Four comprises of appendices, including reflective and epistemological statements

Repository@Hull - Worktribe

FUSE: Lightweight Guaranteed Distributed Failure Notification

Author: Dunagan J.
Harvey N. J. A.
Jones M. B.
Kostic D.
Theimer M.
Wolman A.
Publication venue
Publication date: 24/01/2007
Field of study

FUSE is a lightweight failure notification service for building distributed systems. Distributed systems built with FUSE are guaranteed that failure notifications never fail. Whenever a failure notification is triggered, all live members of the FUSE group will hear a notification within a bounded period of time, irrespective of node or communication failures. In contrast to previous work on failure detection, the responsibility for deciding that afailure has occurred is shared between the FUSE service and the distributed application. This allows applications to implement their own definitions of failure. Our experience building a scalable distributed event delivery system on an overlay network has convinced us of the usefulness of this service. Our results demonstrate that the network costs of each FUSE group can be small; in particular, our overlay network implementation requires no additional liveness-verifying ping traffic beyond that already needed to maintain the overlay, making the steady state network load independent of the number of active FUSE groups

Infoscience - École polytechnique fédérale de Lausanne

ENHANCING CLOUD SYSTEM RUNTIME TO ADDRESS COMPLEX FAILURES

Author: Lou Chang
Publication venue: 'The Busan Gyeongnam Mathematical Society'
Publication date: 03/01/2024
Field of study

As the reliance on cloud systems intensifies in our progressively digital world, understanding and reinforcing their reliability becomes more crucial than ever. Despite impressive advancements in augmenting the resilience of cloud systems, the growing incidence of complex failures now poses a substantial challenge to the availability of these systems. With cloud systems continuing to scale and increase in complexity, failures not only become more elusive to detect but can also lead to more catastrophic consequences. Such failures question the foundational premises of conventional fault-tolerance designs, necessitating the creation of novel system designs to counteract them. This dissertation aims to enhance distributed systems’ capabilities to detect, localize, and react to complex failures at runtime. To this end, this dissertation makes contributions to address three emerging categories of failures in cloud systems. The first part delves into the investigation of partial failures, introducing OmegaGen, a tool adept at generating tailored checkers for detecting and localizing such failures. The second part grapples with silent semantic failures prevalent in cloud systems, showcasing our study findings, and introducing Oathkeeper, a tool that leverages past failures to infer rules and expose these silent issues. The third part explores solutions to slow failures via RESIN, a framework specifically designed to detect, diagnose, and mitigate memory leaks in cloud-scale infrastructures, developed in collaboration with Microsoft Azure. The dissertation concludes by offering insights into future directions for the construction of reliable cloud systems

Johns Hopkins University

Support for dependable and adaptive distributed systems and applications

Author: Dixit Mônica Lopes Muniz Corrêa
Publication venue
Publication date: 01/01/2011
Field of study

Tese de doutoramento, Informática (Engenharia Informática), Universidade de Lisboa, Faculdade de Ciências, 2011Distributed applications executing in uncertain environments, like the Internet, need to make timing/synchrony assumptions (for instance, about the maximum message transmission delay), in order to make progress. In the case of adaptive systems these temporal bounds should be computed at runtime, using probabilistic or specifically designed ad hoc approaches, typically with the objective of improving the application performance. From a dependability perspective, however, the concern is to secure some properties on which the application can rely. This thesis addresses the problem of supporting adaptive systems and applications in stochastic environments, from a dependability perspective: maintaining the correctness of system properties after adaptation. The idea behind dependable adaptation consists in ensuring that the assumed bounds for fundamental variables (e.g., network delays) are secured with a known and constant probability. Assuming that during its lifetime a system alternates periods where its temporal behavior is well characterized (stable phases), with transition periods where a variation of the network conditions occurs (transient phases), the proposed approach is based on the following: if the environment is generically characterized in analytical terms and it is possible to detect the alternation of these stable and transient phases, then it is possible to effectively and dependably adapt applications. Based on this idea, the thesis introduces Adaptare, a framework for supporting dependable adaptation in stochastic environments. An extensive evaluation of Adaptare is provided, assessing the correctness and effectiveness of the implemented mechanisms. The results indicate that the proposed strategies and methodologies are indeed effective to support dependable adaptation of distributed systems and applications. Finally, the applicability of Adaptare is evaluated in the context of two fundamental problems in distributed systems: consensus and failure detection. The thesis proposes solutions for these problems based on modular architectures in which Adaptare is used as a middleware for dependable adaptation of assumed timeouts.Aplicações distribuídas que executam em ambientes incertos, como a Internet, baseiam-se em pressupostos sobre tempo/sincronia (por exemplo, assumem um tempo máximo para a transmissão de mensagens) a fim de assegurar progresso. No caso de sistemas adaptativos, esses limites temporais devem ser calculados em tempo de execução, usando abordagens probabilísticas ou desenhadas de forma específica e ad hoc, tipicamente visando melhorar o desempenho da aplicação. Sob o ponto de vista da confiabilidade, no entanto, o objetivo é garantir algumas propriedades nas quais a aplicação pode confiar. Esta tese aborda o problema de suportar sistemas adaptativos e aplicações que operam em ambientes estocásticos, numa perspectiva de confiabilidade: mantendo a correção das propriedades do sistema após a adaptação. A ideia da adaptação confiável consiste em garantir que os limites assumidos para variáveis fundamentais (por exemplo, latências de transmissão) são assegurados com uma probabilidade conhecida e constante. Supondo que durante a execução o sistema alterna períodos nos quais o seu comportamento temporal é bem caracterizado (fases estáveis), com períodos de transição durante os quais ocorrem variações das condições da rede (fases transientes), a abordagem proposta baseia-se no seguinte: se o ambiente é genericamente caracterizado em termos analíticos e é possível detetar a alternância entre fases estáveis e transientes, então é possível adaptar as aplicações de forma efetiva e confiável. Com base nesta ideia, a tese apresenta uma plataforma para suportar a adaptação confiável em ambientes estocásticos, denominada Adaptare. A tese contém uma extensa avaliação do Adaptare, que foi realizada para verificar a correção e eficácia dos mecanismos desenvolvidos. Os resultados indicam que as estratégias e metodologias propostas são de facto efetivas para suportar a adaptação confiável de sistemas e aplicações distribuídas. Finalmente, a aplicabilidade do Adaptare é avaliada no contexto de dois problemas fundamentais em sistemas distribuídos: consenso e deteção de falhas. A tese propõe soluções para estes problemas baseadas em arquiteturas modulares nas quais o Adaptare é usado como um middleware para a adaptação confiável de timeouts.Fundação para a Ciência e a Tecnologia (FCT

Universidade de Lisboa: Repositório.UL

Model-driven Fault-Tolerance Provisioning for Component-based Distributed Real-time Embedded Systems

Author: Tambe Sumant
Publication venue: VANDERBILT
Publication date
Field of study

The CORBA object group service:a service approach to object groups in CORBA

Author: Felber Pascal
Publication venue: Lausanne, EPFL
Publication date: 16/03/2005
Field of study

Distributed computing is one of the major trends in the computer industry. As systems become more distributed, they also become more complex and have to deal with new kinds of problems, such as partial crashes and link failures. To answer the growing demand in distributed technologies, several middleware environments have emerged during the last few years. These environments however lack support for "one-to-many" communication primitives; such primitives greatly simplify the development of several types of applications that have requirements for high availability, fault tolerance, parallel processing, or collaborative work. One-to-many interactions can be provided by group communication. It manages groups of objects and provides primitives for sending messages to all members of a group, with various reliability and ordering guarantees. A group constitutes a logical addressing facility: messages can be issued to a group without having to know the number, identity, or location of individual members. The notion of group has proven to be very useful for providing high availability through replication: a set of replicas constitutes a group, but are viewed by clients as a single entity in the system. This thesis aims at studying and proposing solutions to the problem of object group support in object-based middleware environments. It surveys and evaluates different approaches to this problem. Based on this evaluation, we propose a system model and an open architecture to add support for object groups to the CORBA middle- ware environment. In doing so, we provide the application developer with powerful group primitives in the context of a standard object-based environment. This thesis contributes to ongoing standardization efforts that aim to support fault tolerance in CORBA, using entity redundancy. The group architecture proposed in this thesis — the Object Group Service (OGS) — is based on the concept of component integration. It consists of several distinct components that provide various facilities for reliable distributed computing and that are reusable in isolation. Group support is ultimately provided by combining these components. OGS defines an object-oriented framework of CORBA components for reliable distributed systems. The OGS components include a group membership service, which keeps track of the composition of object groups, a group multicast service, which provides delivery of messages to all group members, a consensus service, which allows several CORBA objects to resolve distributed agreement problems, and a monitoring service, which provides distributed failure detection mechanisms. OGS includes support for dynamic group membership and for group multicast with various reliability and ordering guarantees. It defines interfaces for active and primary-backup replication. In addition, OGS proposes several execution styles and various levels of transparency. A prototype implementation of OGS has been realized in the context of this thesis. This implementation is available for two commercial ORBs (Orbix and VisiBroker). It relies solely on the CORBA specification, and is thus portable to any compliant ORB. Although the main theme of this thesis deals with system architecture, we have developed some original algorithms to implement group support in OGS. We analyze these algorithms and implementation choices in this dissertation, and we evaluate them in terms of efficiency. We also illustrate the use of OGS through example applications

Infoscience - École polytechnique fédérale de Lausanne