448 research outputs found
Recommended from our members
A new approach to detecting failures in distributed systems
textFault-tolerant distributed systems often handle failures in two steps: first, detect the failure and, second, take some recovery action. A common approach to detecting failures is end-to-end timeouts, but using timeouts brings problems. First, timeouts are inaccurate: just because a process is unresponsive does not mean that process has failed. Second, choosing a timeout is hard: short timeouts can exacerbate the problem of inaccuracy, and long timeouts can make the system wait unnecessarily. In fact, a good timeout valueāone that balances the choice between accuracy and speedāmay not even exist, owing to the variance in a systemās end-to-end delays. Ęis dissertation posits a new approach to detecting failures in distributed systems: use information about failures that is local to each component, e.g., the contents of an OSās process table. We call such information inside information, and use it as the basis in the design and implementation of three failure reporting services for data center applications, which we call Falcon, Albatross, and Pigeon. Falcon deploys a network of software modules to gather inside information in the system, and it guarantees that it never reports a working process as crashed by sometimes terminating unresponsive components. Ęis choice helps applications by making reports of failure reliable, meaning that applications can treat them as ground truth. Unfortunately, Falcon cannot handle network failures because guaranteeing that a process has crashed requires network communication; we address this problem in Albatross and Pigeon. Instead of killing, Albatross blocks suspected processes from using the network, allowing applications to make progress during network partitions. Pigeon renounces interference altogether, and reports inside information to applications directly and with more detail to help applications make better recovery decisions. By using these services, applications can improve their recovery from failures both quantitatively and qualitatively. Quantitatively, these services reduce detection time by one to two orders of magnitude over the end-to-end timeouts commonly used by data center applications, thereby reducing the unavailability caused by failures. Qualitatively, these services provide more specific information about failures, which can reduce the logic required for recovery and can help applications better decide when recovery is not necessary.Computer Science
DeMMon Decentralized Management and Monitoring Framework
The centralized model proposed by the Cloud computing paradigm mismatches the decentralized
nature of mobile and IoT applications, given the fact that most of the data
production and consumption is performed by end-user devices outside of the Data Center
(DC). As the number of these devices grows, and given the need to transport data to and
from DCs for computation, application providers incur additional infrastructure costs,
and end-users incur delays when performing operations.
These reasons have led us into a post-cloud era, where a new computing paradigm
arose: Edge Computing. Edge Computing takes into account the broad spectrum of
devices residing outside of the DC, closer to the clients, as potential targets for computations,
potentially reducing infrastructure costs, improving the quality of service (QoS)
for end-users and allowing new interaction paradigms between users and applications.
Managing and monitoring the execution of these devices raises new challenges previously
unaddressed by Cloud computing, given the scale of these systems and the devicesā
(potentially) unreliable data connections and heterogenous computational power. The
study of the state-of-the-art has revealed that existing resource monitoring and management
solutions require manual configuration and have centralized components, which
we believe do not scale for larger-scale systems.
In this work, we address these limitations by presenting a novel Decentralized Management
and Monitoring (āDeMMonā) system, targeted for edge settings. DeMMon provides
primitives to ease the development of tools that manage computational resources
that support edge-enabled applications, decomposed in components, through decentralized
actions, taking advantage of partial knowledge of the system. Our solution was
evaluated to amount to its benefits regarding information dissemination and monitoring
capabilities across a set of realistic emulated scenarios of up to 750 nodes with variable
failure rates. The results show the validity of our approach and that it can outperform
state-of-the-art solutions regarding scalability and reliabilityO modelo centralizado de computaĆ§Ć£o utilizado no paradigma da ComputaĆ§Ć£o na Nuvem
apresenta limitaƧƵes no contexto de aplicaƧƵes no domĆnio da Internet das Coisas
e aplicaƧƵes mĆ³veis. Neste tipo de aplicaƧƵes, os dados sĆ£o produzidos e consumidos
maioritariamente por dispositivos que se encontram na periferia da rede. Desta forma,
transportar estes dados de e para os centros de dados impƵe uma carga excessiva nas
infraestruturas de rede que ligam os dispositivos aos centros de dados, aumentando a
latĆŖncia de respostas e diminuindo a qualidade de serviƧo para os utilizadores.
Para combater estas limitaƧƵes, surgiu o paradigma da ComputaĆ§Ć£o na Periferia, este
paradigma propƵe a execuĆ§Ć£o de computaƧƵes, e potencialmente armazenamento de
dados, em dispositivos fora dos centros de dados, mais perto dos clientes, reduzindo
custos e criando um novo leque de possibilidades para efetuar computaƧƵes distribuĆdas
mais prĆ³ximas dos dispositivos que produzem e consomem os dados.
Contudo, gerir e supervisionar a execuĆ§Ć£o desses dispositivos levanta obstĆ”culos nĆ£o
equacionados pela ComputaĆ§Ć£o na Nuvem, como a escala destes sistemas, ou a variabilidade
na conectividade e na capacidade de computaĆ§Ć£o dos dispositivos que os compƵem.
O estudo da literatura revela que ferramentas populares para gerir e supervisionar aplicaƧƵes
e dispositivos possuem limitaƧƵes para a sua escalabilidade, como por exemplo,
pontos de falha centralizados, ou requerem a configuraĆ§Ć£o manual de cada dispositivo.
Nesta dissertaĆ§Ć£o, propƵem-se uma nova soluĆ§Ć£o de monitorizaĆ§Ć£o e disseminaĆ§Ć£o
de informaĆ§Ć£o descentralizada. Esta soluĆ§Ć£o oferece operaƧƵes que permitem recolher
informaĆ§Ć£o sobre o estado do sistema, de modo a ser utilizada por soluƧƵes (tambĆ©m
descentralizadas) que gerem aplicaƧƵes especializadas para executar na periferia da rede.
A nossa soluĆ§Ć£o foi avaliada em redes emuladas de vĆ”rias dimensƵes com um mĆ”ximo
de 750 nĆ³s, no contexto de disseminaĆ§Ć£o e de monitorizaĆ§Ć£o de informaĆ§Ć£o. Os nossos
resultados mostram que o nosso sistema consegue ser mais robusto ao mesmo tempo que
Ʃ mais escalƔvel quando comparado com o estado da arte
Innovations in Radiotherapy Technology.
Many low- and middle-income countries, together with remote and low socioeconomic populations within high-income countries, lack the resources and services to deal with cancer. The challenges in upgrading or introducing the necessary services are enormous, from screening and diagnosis to radiotherapy planning/treatment and quality assurance. There are severe shortages not only in equipment, but also in the capacity to train, recruit and retain staff as well as in their ongoing professional development via effective international peer-review and collaboration. Here we describe some examples of emerging technology innovations based on real-time software and cloud-based capabilities that have the potential to redress some of these areas. These include: (i) automatic treatment planning to reduce physics staffing shortages, (ii) real-time image-guided adaptive radiotherapy technologies, (iii) fixed-beam radiotherapy treatment units that use patient (rather than gantry) rotation to reduce infrastructure costs and staff-to-patient ratios, (iv) cloud-based infrastructure programmes to facilitate international collaboration and quality assurance and (v) high dose rate mobile cobalt brachytherapy techniques for intraoperative radiotherapy
Quality of Service of Crash-Recovery Failure Detectors
This thesis presents the results of an investigation into the failure detection problem. We consider the specific case of the Quality of Service (QoS) of crash failure detection. In contrast to previous work, we address the crash failure detection problem when the monitored target is resilient and recovers after failure. To the best of our knowledge, this is the first work to provide an analysis of crash-recovery failure detection from the QoS perspective.We develop a probabilistic model of the behavior of a crash-recovery target, i.e. one
which has the ability to recover from the crash state. We show that the fail-free run
and the crash-stop run are special cases of the crash-recovery run with mean time to
failure (MTTF) approaching to infinity and mean time to recovery (MTTR) approaching to infinity, respectively. We extend the previously published QoS metrics to allow
the measurement of the recovery speed, and the definition of the completeness property
of a failure detector. Then, the impact of the dependability of the crash-recovery
target on the QoS bounds for such a crash-recovery failure detector is analyzed using general dependability metrics, such as MTTF and MTTR, based on an approximate
probabilistic model of the two-process failure detection system. Then according to
our approximate model, we show how to estimate the failure detectorās parameters to
achieve a required QoS, based on Chen et al.ās NFD-S algorithm analytically, and how
to execute the configuration procedure of this crash-recovery failure detector.In order to make the failure detector adaptive to the targetās crash-recovery behavior and enable the autonomy of the monitoring procedure, we propose two types of recovery detection protocols. One is a reliable recovery detection protocol, which can guarantee to detect each occurring failure and recovery by adopting persistent storage. The other is a lightweight recovery detection protocol, which does not guarantee to detect every failure and recovery but which reduces the system overhead. Both of these recovery detection protocols improve the completeness without reducing the other QoS aspects of a failure detector. In addition, we also demonstrate how to estimate the inputs, such as the dependability metrics, using the failure detector itself.In order to evaluate our analytical work, we simulate the following failure detection algorithms: the simple heartbeat timeout algorithm, the NFD-S algorithm and the NFDS
algorithm with the lightweight recovery detection protocol, for various values of
MTTF and MTTR. The simulation results show that the dependability of a recoverable
monitored target could have significant impact on the QoS of such a failure detector.
This conforms well to our models and analysis. We show that in the case of reasonable long MTTF, the NFD-S algorithm with the lightweight recovery detection protocol exhibits better QoS than the NFD-S algorithm for the completeness of a crash-recovery failure detector, and similarly for other QoS metrics
- ā¦