Search CORE

448 research outputs found

Kaiser Foundation Hospital & Kaiser Foundation Health Plan Inc. and UNITE HERE, AFL-CIO, Local 5 (2004)

Author
Publication venue: DigitalCommons@ILR
Publication date: 01/07/2004
Field of study

DigitalCommons@ILR

Recommended from our members

A new approach to detecting failures in distributed systems

Author: Leners Joshua Blaise
Publication venue
Publication date: 18/09/2015
Field of study

textFault-tolerant distributed systems often handle failures in two steps: first, detect the failure and, second, take some recovery action. A common approach to detecting failures is end-to-end timeouts, but using timeouts brings problems. First, timeouts are inaccurate: just because a process is unresponsive does not mean that process has failed. Second, choosing a timeout is hard: short timeouts can exacerbate the problem of inaccuracy, and long timeouts can make the system wait unnecessarily. In fact, a good timeout value—one that balances the choice between accuracy and speed—may not even exist, owing to the variance in a system’s end-to-end delays. ƃis dissertation posits a new approach to detecting failures in distributed systems: use information about failures that is local to each component, e.g., the contents of an OS’s process table. We call such information inside information, and use it as the basis in the design and implementation of three failure reporting services for data center applications, which we call Falcon, Albatross, and Pigeon. Falcon deploys a network of software modules to gather inside information in the system, and it guarantees that it never reports a working process as crashed by sometimes terminating unresponsive components. ƃis choice helps applications by making reports of failure reliable, meaning that applications can treat them as ground truth. Unfortunately, Falcon cannot handle network failures because guaranteeing that a process has crashed requires network communication; we address this problem in Albatross and Pigeon. Instead of killing, Albatross blocks suspected processes from using the network, allowing applications to make progress during network partitions. Pigeon renounces interference altogether, and reports inside information to applications directly and with more detail to help applications make better recovery decisions. By using these services, applications can improve their recovery from failures both quantitatively and qualitatively. Quantitatively, these services reduce detection time by one to two orders of magnitude over the end-to-end timeouts commonly used by data center applications, thereby reducing the unavailability caused by failures. Qualitatively, these services provide more specific information about failures, which can reduce the logic required for recovery and can help applications better decide when recovery is not necessary.Computer Science

Texas ScholarWorks

DeMMon Decentralized Management and Monitoring Framework

Author: Morais Nuno
Publication venue
Publication date: 01/11/2021
Field of study

The centralized model proposed by the Cloud computing paradigm mismatches the decentralized nature of mobile and IoT applications, given the fact that most of the data production and consumption is performed by end-user devices outside of the Data Center (DC). As the number of these devices grows, and given the need to transport data to and from DCs for computation, application providers incur additional infrastructure costs, and end-users incur delays when performing operations. These reasons have led us into a post-cloud era, where a new computing paradigm arose: Edge Computing. Edge Computing takes into account the broad spectrum of devices residing outside of the DC, closer to the clients, as potential targets for computations, potentially reducing infrastructure costs, improving the quality of service (QoS) for end-users and allowing new interaction paradigms between users and applications. Managing and monitoring the execution of these devices raises new challenges previously unaddressed by Cloud computing, given the scale of these systems and the devices’ (potentially) unreliable data connections and heterogenous computational power. The study of the state-of-the-art has revealed that existing resource monitoring and management solutions require manual configuration and have centralized components, which we believe do not scale for larger-scale systems. In this work, we address these limitations by presenting a novel Decentralized Management and Monitoring (“DeMMon”) system, targeted for edge settings. DeMMon provides primitives to ease the development of tools that manage computational resources that support edge-enabled applications, decomposed in components, through decentralized actions, taking advantage of partial knowledge of the system. Our solution was evaluated to amount to its benefits regarding information dissemination and monitoring capabilities across a set of realistic emulated scenarios of up to 750 nodes with variable failure rates. The results show the validity of our approach and that it can outperform state-of-the-art solutions regarding scalability and reliabilityO modelo centralizado de computação utilizado no paradigma da Computação na Nuvem apresenta limitações no contexto de aplicações no domínio da Internet das Coisas e aplicações móveis. Neste tipo de aplicações, os dados são produzidos e consumidos maioritariamente por dispositivos que se encontram na periferia da rede. Desta forma, transportar estes dados de e para os centros de dados impõe uma carga excessiva nas infraestruturas de rede que ligam os dispositivos aos centros de dados, aumentando a latência de respostas e diminuindo a qualidade de serviço para os utilizadores. Para combater estas limitações, surgiu o paradigma da Computação na Periferia, este paradigma propõe a execução de computações, e potencialmente armazenamento de dados, em dispositivos fora dos centros de dados, mais perto dos clientes, reduzindo custos e criando um novo leque de possibilidades para efetuar computações distribuídas mais próximas dos dispositivos que produzem e consomem os dados. Contudo, gerir e supervisionar a execução desses dispositivos levanta obstáculos não equacionados pela Computação na Nuvem, como a escala destes sistemas, ou a variabilidade na conectividade e na capacidade de computação dos dispositivos que os compõem. O estudo da literatura revela que ferramentas populares para gerir e supervisionar aplicações e dispositivos possuem limitações para a sua escalabilidade, como por exemplo, pontos de falha centralizados, ou requerem a configuração manual de cada dispositivo. Nesta dissertação, propõem-se uma nova solução de monitorização e disseminação de informação descentralizada. Esta solução oferece operações que permitem recolher informação sobre o estado do sistema, de modo a ser utilizada por soluções (também descentralizadas) que gerem aplicações especializadas para executar na periferia da rede. A nossa solução foi avaliada em redes emuladas de várias dimensões com um máximo de 750 nós, no contexto de disseminação e de monitorização de informação. Os nossos resultados mostram que o nosso sistema consegue ser mais robusto ao mesmo tempo que é mais escalável quando comparado com o estado da arte

Repositório da Universidade Nova de Lisboa

Fry\u27s Marketplace and United Food & Commercial Workers International Union (UFCW), AFL-CIO-CLC, Local 99 (2003)

Author
Publication venue: DigitalCommons@ILR
Publication date: 26/10/2003
Field of study

DigitalCommons@ILR

Innovations in Radiotherapy Technology.

Author: Beddar S
Court L
Feain IJ
Keall P
Palta JR
Publication venue: 'Elsevier BV'
Publication date: 01/02/2017
Field of study

Many low- and middle-income countries, together with remote and low socioeconomic populations within high-income countries, lack the resources and services to deal with cancer. The challenges in upgrading or introducing the necessary services are enormous, from screening and diagnosis to radiotherapy planning/treatment and quality assurance. There are severe shortages not only in equipment, but also in the capacity to train, recruit and retain staff as well as in their ongoing professional development via effective international peer-review and collaboration. Here we describe some examples of emerging technology innovations based on real-time software and cloud-based capabilities that have the potential to redress some of these areas. These include: (i) automatic treatment planning to reduce physics staffing shortages, (ii) real-time image-guided adaptive radiotherapy technologies, (iii) fixed-beam radiotherapy treatment units that use patient (rather than gantry) rotation to reduce infrastructure costs and staff-to-patient ratios, (iv) cloud-based infrastructure programmes to facilitate international collaboration and quality assurance and (v) high dose rate mobile cobalt brachytherapy techniques for intraoperative radiotherapy

Sydney eScholarship

Quality of Service of Crash-Recovery Failure Detectors

Author: Ma Tiejun
Publication venue
Publication date: 01/01/2007
Field of study

This thesis presents the results of an investigation into the failure detection problem. We consider the specific case of the Quality of Service (QoS) of crash failure detection. In contrast to previous work, we address the crash failure detection problem when the monitored target is resilient and recovers after failure. To the best of our knowledge, this is the first work to provide an analysis of crash-recovery failure detection from the QoS perspective.We develop a probabilistic model of the behavior of a crash-recovery target, i.e. one which has the ability to recover from the crash state. We show that the fail-free run and the crash-stop run are special cases of the crash-recovery run with mean time to failure (MTTF) approaching to infinity and mean time to recovery (MTTR) approaching to infinity, respectively. We extend the previously published QoS metrics to allow the measurement of the recovery speed, and the definition of the completeness property of a failure detector. Then, the impact of the dependability of the crash-recovery target on the QoS bounds for such a crash-recovery failure detector is analyzed using general dependability metrics, such as MTTF and MTTR, based on an approximate probabilistic model of the two-process failure detection system. Then according to our approximate model, we show how to estimate the failure detector’s parameters to achieve a required QoS, based on Chen et al.’s NFD-S algorithm analytically, and how to execute the configuration procedure of this crash-recovery failure detector.In order to make the failure detector adaptive to the target’s crash-recovery behavior and enable the autonomy of the monitoring procedure, we propose two types of recovery detection protocols. One is a reliable recovery detection protocol, which can guarantee to detect each occurring failure and recovery by adopting persistent storage. The other is a lightweight recovery detection protocol, which does not guarantee to detect every failure and recovery but which reduces the system overhead. Both of these recovery detection protocols improve the completeness without reducing the other QoS aspects of a failure detector. In addition, we also demonstrate how to estimate the inputs, such as the dependability metrics, using the failure detector itself.In order to evaluate our analytical work, we simulate the following failure detection algorithms: the simple heartbeat timeout algorithm, the NFD-S algorithm and the NFDS algorithm with the lightweight recovery detection protocol, for various values of MTTF and MTTR. The simulation results show that the dependability of a recoverable monitored target could have significant impact on the QoS of such a failure detector. This conforms well to our models and analysis. We show that in the case of reasonable long MTTF, the NFD-S algorithm with the lightweight recovery detection protocol exhibits better QoS than the NFD-S algorithm for the completeness of a crash-recovery failure detector, and similarly for other QoS metrics

CiteSeerX

Edinburgh Research Archive