Search CORE

83 research outputs found

Byzantine fault-tolerant agreement protocols for wireless Ad hoc networks

Author: Moniz Henrique, 1980-
Publication venue
Publication date: 01/01/2010
Field of study

Tese de doutoramento, Informática (Ciências da Computação), Universidade de Lisboa, Faculdade de Ciências, 2010.The thesis investigates the problem of fault- and intrusion-tolerant consensus in resource-constrained wireless ad hoc networks. This is a fundamental problem in distributed computing because it abstracts the need to coordinate activities among various nodes. It has been shown to be a building block for several other important distributed computing problems like state-machine replication and atomic broadcast. The thesis begins by making a thorough performance assessment of existing intrusion-tolerant consensus protocols, which shows that the performance bottlenecks of current solutions are in part related to their system modeling assumptions. Based on these results, the communication failure model is identified as a model that simultaneously captures the reality of wireless ad hoc networks and allows the design of efficient protocols. Unfortunately, the model is subject to an impossibility result stating that there is no deterministic algorithm that allows n nodes to reach agreement if more than n2 omission transmission failures can occur in a communication step. This result is valid even under strict timing assumptions (i.e., a synchronous system). The thesis applies randomization techniques in increasingly weaker variants of this model, until an efficient intrusion-tolerant consensus protocol is achieved. The first variant simplifies the problem by restricting the number of nodes that may be at the source of a transmission failure at each communication step. An algorithm is designed that tolerates f dynamic nodes at the source of faulty transmissions in a system with a total of n 3f + 1 nodes. The second variant imposes no restrictions on the pattern of transmission failures. The proposed algorithm effectively circumvents the Santoro- Widmayer impossibility result for the first time. It allows k out of n nodes to decide despite dn 2 e(nk)+k2 omission failures per communication step. This algorithm also has the interesting property of guaranteeing safety during arbitrary periods of unrestricted message loss. The final variant shares the same properties of the previous one, but relaxes the model in the sense that the system is asynchronous and that a static subset of nodes may be malicious. The obtained algorithm, called Turquois, admits f < n 3 malicious nodes, and ensures progress in communication steps where dnf 2 e(n k f) + k 2. The algorithm is subject to a comparative performance evaluation against other intrusiontolerant protocols. The results show that, as the system scales, Turquois outperforms the other protocols by more than an order of magnitude.Esta tese investiga o problema do consenso tolerante a faltas acidentais e maliciosas em redes ad hoc sem fios. Trata-se de um problema fundamental que captura a essência da coordenação em actividades envolvendo vários nós de um sistema, sendo um bloco construtor de outros importantes problemas dos sistemas distribuídos como a replicação de máquina de estados ou a difusão atómica. A tese começa por efectuar uma avaliação de desempenho a protocolos tolerantes a intrusões já existentes na literatura. Os resultados mostram que as limitações de desempenho das soluções existentes estão em parte relacionadas com o seu modelo de sistema. Baseado nestes resultados, é identificado o modelo de falhas de comunicação como um modelo que simultaneamente permite capturar o ambiente das redes ad hoc sem fios e projectar protocolos eficientes. Todavia, o modelo é restrito por um resultado de impossibilidade que afirma não existir algoritmo algum que permita a n nós chegaram a acordo num sistema que admita mais do que n2 transmissões omissas num dado passo de comunicação. Este resultado é válido mesmo sob fortes hipóteses temporais (i.e., em sistemas síncronos) A tese aplica técnicas de aleatoriedade em variantes progressivamente mais fracas do modelo até ser alcançado um protocolo eficiente e tolerante a intrusões. A primeira variante do modelo, de forma a simplificar o problema, restringe o número de nós que estão na origem de transmissões faltosas. É apresentado um algoritmo que tolera f nós dinâmicos na origem de transmissões faltosas em sistemas com um total de n 3f + 1 nós. A segunda variante do modelo não impõe quaisquer restrições no padrão de transmissões faltosas. É apresentado um algoritmo que contorna efectivamente o resultado de impossibilidade Santoro-Widmayer pela primeira vez e que permite a k de n nós efectuarem progresso nos passos de comunicação em que o número de transmissões omissas seja dn 2 e(n k) + k 2. O algoritmo possui ainda a interessante propriedade de tolerar períodos arbitrários em que o número de transmissões omissas seja superior a . A última variante do modelo partilha das mesmas características da variante anterior, mas com pressupostos mais fracos sobre o sistema. Em particular, assume-se que o sistema é assíncrono e que um subconjunto estático dos nós pode ser malicioso. O algoritmo apresentado, denominado Turquois, admite f < n 3 nós maliciosos e assegura progresso nos passos de comunicação em que dnf 2 e(n k f) + k 2. O algoritmo é sujeito a uma análise de desempenho comparativa com outros protocolos na literatura. Os resultados demonstram que, à medida que o número de nós no sistema aumenta, o desempenho do protocolo Turquois ultrapassa os restantes em mais do que uma ordem de magnitude.FC

Universidade de Lisboa: Repositório.UL

Intrusion Tolerant Routing Protocols for Wireless Sensor Networks

Author: Guerreiro André Ivo Azevedo
Publication venue
Publication date: 01/01/2011
Field of study

This MSc thesis is focused in the study, solution proposal and experimental evaluation of security solutions for Wireless Sensor Networks (WSNs). The objectives are centered on intrusion tolerant routing services, adapted for the characteristics and requirements of WSN nodes and operation behavior. The main contribution addresses the establishment of pro-active intrusion tolerance properties at the network level, as security mechanisms for the proposal of a reliable and secure routing protocol. Those properties and mechanisms will augment a secure communication base layer supported by light-weigh cryptography methods, to improve the global network resilience capabilities against possible intrusion-attacks on the WSN nodes. Adapting to WSN characteristics, the design of the intended security services also pushes complexity away from resource-poor sensor nodes towards resource-rich and trustable base stations. The devised solution will construct, securely and efficiently, a secure tree-structured routing service for data-dissemination in large scale deployed WSNs. The purpose is to tolerate the damage caused by adversaries modeled according with the Dolev-Yao threat model and ISO X.800 attack typology and framework, or intruders that can compromise maliciously the deployed sensor nodes, injecting, modifying, or blocking packets, jeopardizing the correct behavior of internal network routing processing and topology management. The proposed enhanced mechanisms, as well as the design and implementation of a new intrusiontolerant routing protocol for a large scale WSN are evaluated by simulation. For this purpose, the evaluation is based on a rich simulation environment, modeling networks from hundreds to tens of thousands of wireless sensors, analyzing different dimensions: connectivity conditions, degree-distribution patterns, latency and average short-paths, clustering, reliability metrics and energy cost

Repositório da Universidade Nova de Lisboa

Services for safety-critical applications on dual-scheduled TDMA networks

Author: Rosset Valerio
Publication venue
Publication date: 01/01/2009
Field of study

Tese de doutoramento. Engenharia Electrotécnica e de Computadores. Faculdade de Engenharia. Universidade do Porto. 200

Repositório Aberto da Universidade do Porto

SoK: Understanding BFT Consensus in the Age of Blockchains

Author: Gang Wang
Publication venue: International Association for Cryptologic Research (IACR)
Publication date: 05/07/2021
Field of study

Blockchain as an enabler to current Internet infrastructure has provided many unique features and revolutionized current distributed systems into a new era. Its decentralization, immutability, and transparency have attracted many applications to adopt the design philosophy of blockchain and customize various replicated solutions. Under the hood of blockchain, consensus protocols play the most important role to achieve distributed replication systems. The distributed system community has extensively studied the technical components of consensus to reach agreement among a group of nodes. Due to trust issues, it is hard to design a resilient system in practical situations because of the existence of various faults. Byzantine fault-tolerant (BFT) state machine replication (SMR) is regarded as an ideal candidate that can tolerate arbitrary faulty behaviors. However, the inherent complexity of BFT consensus protocols and their rapid evolution makes it hard to practically adapt themselves into application domains. There are many excellent Byzantine-based replicated solutions and ideas that have been contributed to improving performance, availability, or resource efficiency. This paper conducts a systematic and comprehensive study on BFT consensus protocols with a specific focus on the blockchain era. We explore both general principles and practical schemes to achieve consensus under Byzantine settings. We then survey, compare, and categorize the state-of-the-art solutions to understand BFT consensus in detail. For each representative protocol, we conduct an in-depth discussion of its most important architectural building blocks as well as the key techniques they used. We aim that this paper can provide system researchers and developers a concrete view of the current design landscape and help them find solutions to concrete problems. Finally, we present several critical challenges and some potential research directions to advance the research on exploring BFT consensus protocols in the age of blockchains

Cryptology ePrint Archive

Constructing fail-controlled nodes for distributed systems: a software approach

Author: Brasileiro Francisco Vilar
Publication venue: Newcastle University
Publication date: 01/01/1995
Field of study

PhD ThesisDesigning and implementing distributed systems which continue to provide specified services in the presence of processing site and communication failures is a difficult task. To facilitate their development, distributed systems have been built assuming that their underlying hardware components are Jail-controlled, i.e. present a well defined failure mode. However, if conventional hardware cannot provide the assumed failure mode, there is a need to build processing sites or nodes, and communication infra-structure that present the fail-controlled behaviour assumed. Coupling a number of redundant processors within a replicated node is a well known way of constructing fail-controlled nodes. Computation is replicated and executed simultaneously at each processor, and by employing suitable validation techniques to the outputs generated by processors (e.g. majority voting, comparison), outputs from faulty processors can be prevented from appearing at the application level. One way of constructing replicated nodes is by introducing hardwired mechanisms to couple replicated processors with specialised validation hardware circuits. Processors are tightly synchronised at the clock cycle level, and have their outputs validated by a reliable validation hardware. Another approach is to use software mechanisms to perform synchronisation of processors and validation of the outputs. The main advantage of hardware based nodes is the minimum performance overhead incurred. However, the introduction of special circuits may increase the complexity of the design tremendously. Further, every new microprocessor architecture requires considerable redesign overhead. Software based nodes do not present these problems, on the other hand, they introduce much bigger performance overheads to the system. In this thesis we investigate alternative ways of constructing efficient fail-controlled, software based replicated nodes. In particular, we present much more efficient order protocols, which are necessary for the implementation of these nodes. Our protocols, unlike others published to date, do not require processors' physical clocks to be explicitly synchronised. The main contribution of this thesis is the precise definition of the semantics of a software based Jail-silent node, along with its efficient design, implementation and performance evaluation.The Brazilian National Research Council (CNPq/Brasil)

Newcastle University eTheses

Reliability Mechanisms for Controllers in Real-Time Cyber-Physical Systems

Author: Mashood Mohiuddin Maaz
Publication venue: Lausanne, EPFL
Publication date: 22/10/2018
Field of study

Cyber-physical systems (CPSs) are real-world processes that are controlled by computer algorithms. We consider CPSs where a centralized, software-based controller maintains the process in a desired state by exchanging measurements and setpoints with process agents (PAs). As CPSs control processes with low-inertia, e.g., electric grids and autonomous cars, the controller needs to satisfy stringent real-time constraints. However, the controllers are susceptible to delay and crash faults, and the communication network might drop, delay or reorder messages. This degrades the quality of control of the physical process, failure of which can result in damage to life or property. Existing reliability solutions are either not well-suited for real-time CPSs or impose serious restrictions on the controllers. In this thesis, we design, implement and evaluate reliability mechanisms for real-time CPS controllers that require minimal modifications to the controller itself. We begin by abstracting the execution of a CPS using events in the CPS, and the two inherent relations among those events, namely network and computation relations. We use these relations to introduce the intentionality relation that uses these events to capture the state of the physical process. Based on the intentionality relation, we define three correctness properties namely, state safety, optimal selection and consistency, that together provide linearizability (one-copy equivalence) for CPS controllers. We propose intentionality clocks and Quarts, and prove that they provide linearizability. To provide consistency, Quarts ensures agreement among controller replicas, which is typically achieved using consensus. Consensus can add an unbounded-latency overhead. Quarts leverages the properties specific to CPSs to perform agreement using pre-computed priorities among sets of received measurements, resulting in a bounded-latency overhead with high availability. Using simulation, we show that availability of Quarts, with two replicas, is more than an order of magnitude higher than consensus. We also propose Axo, a fault-tolerance protocol that uses active replication to detect and recover faulty replicas, and provide timeliness that requires delayed setpoints be masked from the PAs. We study the effect of delay faults and the impact of fault-tolerance with Axo, by deploying Axo in two real-world CPSs. Then, we realize that the proposed reliability mechanisms also apply to unconventional CPSs such as software defined networking (SDN), where the controlled process is the routing fabric of the network. We show that, in SDN, violating consistency can cause implementation of incorrect routing policies. Thus, we use Quarts and intentionality clocks, to design and implement QCL, a coordination layer for SDN controllers that guarantees control-plane consistency. QCL also drastically reduces the response time of SDN controllers when compared to consensus-based techniques. In the last part of the thesis, we address the problem of reliable communication between the software agents, in a wide-area network that can drop, delay or reorder messages. For this, we propose iPRP, an IP-friendly parallel redundancy protocol for 0 ms repair of packet losses. iPRP requires fail-independent paths for high-reliability. So, we study the fail-independence of Wi-Fi links using real-life measurements, as a first step towards using Wi-Fi for real-time communication in CPSs

Infoscience - École polytechnique fédérale de Lausanne

Tamper-Resistant Peer-to-Peer Storage for File Integrity Checking.

Author: Zangerl Alexander
Publication venue
Publication date: 01/01/2006
Field of study

“... oba es gibt kan Kompromiß, zwischen ehrlich sein und link, a wann’s no so afoch ausschaut, und wann’s noch so üblich is...” — Wolfgang Ambros, 1975 One of the activities of most successful intruders of a computer system is to modify data on the victim, either to hide his/her presence and to destroy the evidence of the break-in, or to subvert the system completely and make it accessible for further abuse without triggering alarms. File integrity checking is one common method to mitigate the effects of successful intrusions by detecting the changes an intruder makes to files on a computer system. Historically file integrity checking has been implemented using tools that operate locally on a single system, which imposes quite some restrictions regarding maintenance and scalability. Recent improvements for large scale environments have introduced trusted central servers which provide secure fingerprint storage and logging facilities, but such centralism presents some new shortcomings

Bond University Research Portal

CiteSeerX

Fault injection testing of software implemented fault tolerance mechanisms of distributed systems

Author: Tao Sha
Publication venue: Newcastle University
Publication date: 01/01/1996
Field of study

PhD ThesisOne way of gaining confidence in the adequacy of fault tolerance mechanisms of a system is to test the system by injecting faults and see how the system performs under faulty conditions. This thesis investigates the issues of testing software-implemented fault tolerance mechanisms of distributed systems through fault injection. A fault injection method has been developed. The method requires that the target software system be structured as a collection of objects interacting via messages. This enables easy insertion of fault injection objects into the target system to emulate incorrect behaviour of faulty processors by manipulating messages. This approach allows one to inject specific classes of faults while not requiring any significant changes to the target system. The method differs from the previous work in that it exploits an object oriented approach of software implementation to support the injection of specific classes of faults at the system level. The proposed fault injection method has been applied to test software-implemented reliable node systems: a TMR (triple modular redundant) node and a fail-silent node. The nodes have integrated fault tolerance mechanisms and are expected to exhibit certain behaviour in the presence of a failure. The thesis describes how various such mechanisms (for example, clock synchronisation protocol, and atomic broadcast protocol) were tested. The testing revealed flaws in implementation that had not been discovered before, thereby demonstrating the usefulness of the method. Application of the approach to other distributed systems is also described in the thesis.CEC ESPRIT programme, UK Engineering and Physical Sciences Research Council (EPSRC)

Newcastle University eTheses

Recommended from our members

A new approach to detecting failures in distributed systems

Author: Leners Joshua Blaise
Publication venue
Publication date: 18/09/2015
Field of study

textFault-tolerant distributed systems often handle failures in two steps: first, detect the failure and, second, take some recovery action. A common approach to detecting failures is end-to-end timeouts, but using timeouts brings problems. First, timeouts are inaccurate: just because a process is unresponsive does not mean that process has failed. Second, choosing a timeout is hard: short timeouts can exacerbate the problem of inaccuracy, and long timeouts can make the system wait unnecessarily. In fact, a good timeout value—one that balances the choice between accuracy and speed—may not even exist, owing to the variance in a system’s end-to-end delays. ƃis dissertation posits a new approach to detecting failures in distributed systems: use information about failures that is local to each component, e.g., the contents of an OS’s process table. We call such information inside information, and use it as the basis in the design and implementation of three failure reporting services for data center applications, which we call Falcon, Albatross, and Pigeon. Falcon deploys a network of software modules to gather inside information in the system, and it guarantees that it never reports a working process as crashed by sometimes terminating unresponsive components. ƃis choice helps applications by making reports of failure reliable, meaning that applications can treat them as ground truth. Unfortunately, Falcon cannot handle network failures because guaranteeing that a process has crashed requires network communication; we address this problem in Albatross and Pigeon. Instead of killing, Albatross blocks suspected processes from using the network, allowing applications to make progress during network partitions. Pigeon renounces interference altogether, and reports inside information to applications directly and with more detail to help applications make better recovery decisions. By using these services, applications can improve their recovery from failures both quantitatively and qualitatively. Quantitatively, these services reduce detection time by one to two orders of magnitude over the end-to-end timeouts commonly used by data center applications, thereby reducing the unavailability caused by failures. Qualitatively, these services provide more specific information about failures, which can reduce the logic required for recovery and can help applications better decide when recovery is not necessary.Computer Science

Texas ScholarWorks

Tolerância a falhas em sistemas de comunicação de tempo-real flexíveis

Author: Ferreira Joaquim José de Castro
Publication venue: Universidade de Aveiro
Publication date: 01/01/2005
Field of study

Nas últimas décadas, os sistemas embutidos distribuídos, têm sido usados em variados domínios de aplicação, desde o controlo de processos industriais até ao controlo de aviões e automóveis, sendo expectável que esta tendência se mantenha e até se intensifique durante os próximos anos. Os requisitos de confiabilidade de algumas destas aplicações são extremamente importantes, visto que o não cumprimento de serviços de uma forma previsível e pontual pode causar graves danos económicos ou até pôr em risco vidas humanas. A adopção das melhores práticas de projecto no desenvolvimento destes sistemas não elimina, por si só, a ocorrência de falhas causadas pelo comportamento não determinístico do ambiente onde o sistema embutido distribuído operará. Desta forma, é necessário incluir mecanismos de tolerância a falhas que impeçam que eventuais falhas possam comprometer todo o sistema. Contudo, para serem eficazes, os mecanismos de tolerância a falhas necessitam ter conhecimento a priori do comportamento correcto do sistema de modo a poderem ser capazes de distinguir os modos correctos de funcionamento dos incorrectos. Tradicionalmente, quando se projectam mecanismos de tolerância a falhas, o conhecimento a priori significa que todos os possíveis modos de funcionamento são conhecidos na fase de projecto, não os podendo adaptar nem fazer evoluir durante a operação do sistema. Como consequência, os sistemas projectados de acordo com este princípio ou são completamente estáticos ou permitem apenas um pequeno número de modos de operação. Contudo, é desejável que os sistemas disponham de alguma flexibilidade de modo a suportarem a evolução dos requisitos durante a fase de operação, simplificar a manutenção e reparação, bem como melhorar a eficiência usando apenas os recursos do sistema que são efectivamente necessários em cada instante. Além disto, esta eficiência pode ter um impacto positivo no custo do sistema, em virtude deste poder disponibilizar mais funcionalidades com o mesmo custo ou a mesma funcionalidade a um menor custo. Porém, flexibilidade e confiabilidade têm sido encarados como conceitos conflituais. Isto deve-se ao facto de flexibilidade implicar a capacidade de permitir a evolução dos requisitos que, por sua vez, podem levar a cenários de operação imprevisíveis e possivelmente inseguros. Desta fora, é comummente aceite que apenas um sistema completamente estático pode ser tornado confiável, o que significa que todos os aspectos operacionais têm de ser completamente definidos durante a fase de projecto. Num sentido lato, esta constatação é verdadeira. Contudo, se os modos como o sistema se adapta a requisitos evolutivos puderem ser restringidos e controlados, então talvez seja possível garantir a confiabilidade permanente apesar das alterações aos requisitos durante a fase de operação. A tese suportada por esta dissertação defende que é possível flexibilizar um sistema, dentro de limites bem definidos, sem comprometer a sua confiabilidade e propõe alguns mecanismos que permitem a construção de sistemas de segurança crítica baseados no protocolo Controller Area Network (CAN). Mais concretamente, o foco principal deste trabalho incide sobre o protocolo Flexible Time-Triggered CAN (FTT-CAN), que foi especialmente desenvolvido para disponibilizar um grande nível de flexibilidade operacional combinando, não só as vantagens dos paradigmas de transmissão de mensagens baseados em eventos e em tempo, mas também a flexibilidade associada ao escalonamento dinâmico do tráfego cuja transmissão é despoletada apenas pela evolução do tempo. Este facto condiciona e torna mais complexo o desenvolvimento de mecanismos de tolerância a falhas para FTT-CAN do que para outros protocolos como por exemplo, TTCAN ou FlexRay, nos quais existe um conhecimento estático, antecipado e comum a todos os nodos, do escalonamento de mensagens cuja transmissão é despoletada pela evolução do tempo. Contudo, e apesar desta complexidade adicional, este trabalho demonstra que é possível construir mecanismos de tolerância a falhas para FTT-CAN preservando a sua flexibilidade operacional. É também defendido nesta dissertação que um sistema baseado no protocolo FTT-CAN e equipado com os mecanismos de tolerância a falhas propostos é passível de ser usado em aplicações de segurança crítica. Esta afirmação é suportada, no âmbito do protocolo FTT-CAN, através da definição de uma arquitectura tolerante a falhas integrando nodos com modos de falha tipo falha-silêncio e nodos mestre replicados. Os vários problemas resultantes da replicação dos nodos mestre são, também eles, analisados e várias soluções são propostas para os obviar. Concretamente, é proposto um protocolo que garante a consistência das estruturas de dados replicadas a quando da sua actualização e um outro protocolo que permite a transferência dessas estruturas de dados para um nodo mestre que se encontre não sincronizado com os restantes depois de inicializado ou reinicializado de modo assíncrono. Além disto, esta dissertação também discute o projecto de nodos FTT-CAN que exibam um modo de falha do tipo falha-silêncio e propõe duas soluções baseadas em componentes de hardware localizados no interface de rede de cada nodo, para resolver este problema. Uma das soluções propostas baseiase em bus guardians que permitem a imposição de comportamento falhasilêncio nos nodos escravos e suportam o escalonamento dinâmico de tráfego na rede. A outra solução baseia-se num interface de rede que arbitra o acesso de dois microprocessadores ao barramento. Este interface permite que a replicação interna de um nodo seja efectuada de forma transparente e assegura um comportamento falha-silêncio quer no domínio temporal quer no domínio do valor ao permitir transmissões do nodo apenas quando ambas as réplicas coincidam no conteúdo das mensagens e nos instantes de transmissão. Esta última solução está mais adaptada para ser usada nos nodos mestre, contudo também poderá ser usada nos nodos escravo, sempre que tal se revele fundamental.Distributed embedded systems (DES) have been widely used in the last few decades in several application fields, ranging from industrial process control to avionics and automotive systems. In fact, it is expectable that this trend will continue over the years to come. In some of these application domains the dependability requirements are of utmost importance since failing to provide services in a timely and predictable manner may cause important economic losses or even put human life in risk. The adoption of the best practices in the design of distributed embedded systems does not fully avoid the occurrence of faults, arising from the nondeterministic behavior of the environment where each particular DES operates. Thus, fault-tolerance mechanisms need to be included in the DES to prevent possible faults leading to system failure. To be effective, fault-tolerance mechanisms require an a priori knowledge of the correct system behavior to be capable of distinguishing them from the erroneous ones. Traditionally, when designing fault-tolerance mechanisms, the a priori knowledge means that all possible operational modes are known at system design time and cannot adapt nor evolve during runtime. As a consequence, systems designed according to this principle are either fully static or allow a small number of operational modes only. Flexibility, however, is a desired property in a system in order to support evolving requirements, simplify maintenance and repair, and improve the efficiency in using system resources by using only the resources that are effectively required at each instant. This efficiency might impact positively on the system cost because with the same resources one can add more functionality or one can offer the same functionality with fewer resources. However, flexibility and dependability are often regarded as conflicting concepts. This is so because flexibility implies the ability to deal with evolving requirements that, in turn, can lead to unpredictable and possibly unsafe operating scenarios. Therefore, it is commonly accepted that only a fully static system can be made dependable, meaning that all operating conditions are completely defined at pre-runtime. In the broad sense and assuming unbounded flexibility this assessment is true, but if one restricts and controls the ways the system could adapt to evolving requirements, then it might be possible to enforce continuous dependability. This thesis claims that it is possible to provide a bounded degree of flexibility without compromising dependability and proposes some mechanisms to build safety-critical systems based on the Controller Area Network (CAN). In particular, the main focus of this work is the Flexible Time-Triggered CAN protocol (FTT-CAN), which was specifically developed to provide such high level of operational flexibility, not only combining the advantages of time- and event-triggered paradigms but also providing flexibility to the time-triggered traffic. This fact makes the development of fault-tolerant mechanisms more complex in FTT-CAN than in other protocols, such as TTCAN or FlexRay, in which there is a priori static common knowledge of the time-triggered message schedule shared by all nodes. Nevertheless, as it is demonstrated in this work, it is possible to build fault-tolerant mechanisms for FTT-CAN that preserve its high level of operational flexibility, particularly concerning the time-triggered traffic. With such mechanisms it is argued that FTT-CAN is suitable for safetycritical applications, too. This claim was validated in the scope of the FTT-CAN protocol by presenting a fault-tolerant system architecture with replicated masters and fail-silent nodes. The specific problems and mechanisms related with master replication, particularly a protocol to enforce consistency during updates of replicated data structures and another protocol to transfer these data structures to an unsynchronized node upon asynchronous startup or restart, are also addressed. Moreover, this thesis also discusses the implementations of fail-silence in FTTCAN nodes and proposes two solutions, both based on hardware components that are attached to the node network interface. One solution relies on bus guardians that allow enforcing fail-silence in the time domain. These bus guardians are adapted to support dynamic traffic scheduling and are fit for use in FTT-CAN slave nodes, only. The other solution relies on a special network interface, with duplicated microprocessor interface, that supports internal replication of the node, transparently. In this case, fail-silence can be assured both in the time and value domain since transmissions are carried out only if both internal nodes agree on the transmission instant and message contents. This solution is well adapted for use in the masters but it can also be used, if desired, in slave nodes

Repositório Institucional da Universidade de Aveiro