8 research outputs found
Counter Attack on Byzantine Generals: Parameterized Model Checking of Fault-tolerant Distributed Algorithms
We introduce an automated parameterized verification method for
fault-tolerant distributed algorithms (FTDA). FTDAs are parameterized by both
the number of processes and the assumed maximum number of Byzantine faulty
processes. At the center of our technique is a parametric interval abstraction
(PIA) where the interval boundaries are arithmetic expressions over parameters.
Using PIA for both data abstraction and a new form of counter abstraction, we
reduce the parameterized problem to finite-state model checking. We demonstrate
the practical feasibility of our method by verifying several variants of the
well-known distributed algorithm by Srikanth and Toueg. Our semi-decision
procedures are complemented and motivated by an undecidability proof for FTDA
verification which holds even in the absence of interprocess communication. To
the best of our knowledge, this is the first paper to achieve parameterized
automated verification of Byzantine FTDA
The Complexity of Asynchronous Byzantine Consensus
This paper establishes the first theorem relating resilience, round complexity and authentication in distributed computing. We give an exact measure of the time complexity of consensus algorithms that tolerate Byzantine failures and arbitrary long periods of asynchrony as in the Internet. The measure expresses the ability of processes to reach a consensus decision in a minimal number of rounds of information exchange, as a function of (a) the ability to use authentication and (b) the number of actual process failures, in those rounds, as well as of (c) the total number of failures tolerated and (d) the system configuration. The measure holds for a framework where the different roles of processes are distinguished such that we can directly derive a meaningful bound on the time complexity of implementing robust general services in practical distributed systems. To prove our theorem, we establish certain lower bounds and we give algorithms that match these bounds. The algorithms are all variants of the same generic asynchronous Byzantine consensus algorithm, which is interesting in its own right
Consensus with byzantine failures and little system synchrony
We study consensus in a message-passing system where only some of the links exhibit some synchrony. This problem was previously studied for systems with process crashes; we now consider byzantine failures. We show that consensus can be solved in a system where there is at least one non-faulty process whose links are eventually timely; all other links can be arbitrarily slow. We also show that, in terms of problem solvability, such a system is strictly weaker than one where all links are eventually timely.
Abstractions for Solving Consensus and Related Problems with Byzantine Faults
We become increasingly dependent on online services; therefore, their availability and correct behavior become increasingly important. Software replication is a popular technique for ensuring that computer systems continue to provide a correct service even when some of their components fail. By replicating a service on multiple servers, clients are guaranteed that even if some replica fails, the service is still available. At the core of software replication is the consensus problem, where a set of processes has to agree on a single value. A large number of consensus algorithms for different system models have been proposed. The most general system models (for which consensus is solvable) do not make strong assumptions on the synchrony (allow period of asynchrony) and assume that a subset of processes can fail completely arbitrarily (Byzantine faults). However, solving consensus in the presence of arbitrary faults and asynchrony is hard and demands sophisticated algorithms. Most of the existing consensus algorithms that deal with arbitrary faults are monolithic and developed from scratch, or by modifying existing algorithms in a non-modular manner. As a consequence, these algorithms are rather complex and hard to understand. We impute this complexity to the lack of adequate abstractions. The motivation of this thesis is suggesting abstractions that simplify the understanding of existing consensus algorithms with arbitrary faults and allow modular design of novel algorithms. The thesis also aims to clarify relations between consensus and the total-order broadcast problem in the presence of arbitrary faults. In the context of the consensus problem with arbitrary process faults, the literature distinguishes (1) authenticated Byzantine faults, where messages can be signed by the sending process, and (2) Byzantine faults, where there is no mechanism for signatures. Consensus protocols that assume Byzantine faults (without authentication) are harder to develop and prove correct than algorithms that consider authenticated Byzantine faults, even when they are based on the same idea. We propose an abstraction called weak interactive consistency (or WIC), that allows us to design consensus algorithms that can be instantiated into algorithms for authenticated Byzantine faults (signed messages) and algorithms for Byzantine faults. In other words, WIC unifies Byzantine consensus algorithms with and without signatures. This is illustrated on two seminal Byzantine consensus algorithms: the Castro-Liskov PBFT algorithm (no signatures) and the Martin-Alvisi FaB Paxos algorithms (signatures). WIC allows a very concise expression of these two algorithms. Furthermore, WIC turns out to be fundamental abstraction for solving consensus in the transmission fault model. The transmission fault model captures faults without blaming a specific component for the fault, and it is well-adapted to dynamic and transient faults. Using WIC we designed a consensus algorithm that overcomes limitations of all existing solutions to consensus in this model, which assume the synchronous system model, or require strong conditions for termination that exclude the case where all messages of a process can be corrupted. Then we go one step further in unifying consensus algorithms by proposing a generic consensus algorithm that highlights, through well chosen parameters, the core mechanisms of a number of well-known consensus algorithms including Paxos, OneThirdRule, PBFT and FaB Paxos. Interestingly, the generic algorithm allows us to identify a new Byzantine consensus algorithm that requires n > 4b, in-between the requirement n > 5b of FaB Paxos and n > 3b of PBFT (b is the maximum number of Byzantine processes). Afterwards, we study the relation between consensus and total-order broadcast in the presence of Byzantine faults. Total-order broadcast is defined for a set of processes, where each process can broadcast messages, with the guarantee that all processes in this set see the same sequence of messages. Among the several definitions of Byzantine consensus that differ only by their validity property, we identify those equivalent to total-order broadcast. We also give the first deterministic total-order broadcast reduction to consensus with constant time complexity with respect to consensus. Finally, we consider state-machine replication (SMR) with Byzantine faults. State-machine replication is a general approach for replicating services that can be modeled as a state machine. The key idea of this approach is to guarantee that all replicas start in the same state and then apply requests from clients in the same order, thereby guaranteeing that the replica states do not diverge. Recent studies has shown that most BFT-SMR algorithms do not actually perform well under performance attacks by Byzantine processes. We propose a new BFT-SMR algorithm, called BFT-Mencius, that guarantees, assuming a partially synchronous system model, that the latency of updates of correct processes is eventually upper-bounded, even under performance attacks by Byzantine processes. BFT-Mencius is a modular, signature-free algorithm based on a new communication primitive called Abortable Timely Announced Broadcast (ATAB). We evaluate the performance of BFT-Mencius in cluster settings, and show that it performs comparably to the state-of-the-art algorithms such as PBFT and Spinning in fault-free configurations and outperforms these algorithms under performance attacks by Byzantine processes
Tolerância a falhas em sistemas de comunicação de tempo-real flexÃveis
Nas últimas décadas, os sistemas embutidos distribuÃdos, têm sido usados em
variados domÃnios de aplicação, desde o controlo de processos industriais até
ao controlo de aviões e automóveis, sendo expectável que esta tendência se
mantenha e até se intensifique durante os próximos anos.
Os requisitos de confiabilidade de algumas destas aplicações são
extremamente importantes, visto que o não cumprimento de serviços de uma
forma previsÃvel e pontual pode causar graves danos económicos ou até pôr
em risco vidas humanas.
A adopção das melhores práticas de projecto no desenvolvimento destes
sistemas não elimina, por si só, a ocorrência de falhas causadas pelo
comportamento não determinÃstico do ambiente onde o sistema embutido
distribuÃdo operará. Desta forma, é necessário incluir mecanismos de
tolerância a falhas que impeçam que eventuais falhas possam comprometer
todo o sistema.
Contudo, para serem eficazes, os mecanismos de tolerância a falhas
necessitam ter conhecimento a priori do comportamento correcto do sistema
de modo a poderem ser capazes de distinguir os modos correctos de
funcionamento dos incorrectos.
Tradicionalmente, quando se projectam mecanismos de tolerância a falhas, o
conhecimento a priori significa que todos os possÃveis modos de
funcionamento são conhecidos na fase de projecto, não os podendo adaptar
nem fazer evoluir durante a operação do sistema. Como consequência, os
sistemas projectados de acordo com este princÃpio ou são completamente
estáticos ou permitem apenas um pequeno número de modos de operação.
Contudo, é desejável que os sistemas disponham de alguma flexibilidade de
modo a suportarem a evolução dos requisitos durante a fase de operação,
simplificar a manutenção e reparação, bem como melhorar a eficiência usando
apenas os recursos do sistema que são efectivamente necessários em cada
instante. Além disto, esta eficiência pode ter um impacto positivo no custo do
sistema, em virtude deste poder disponibilizar mais funcionalidades com o
mesmo custo ou a mesma funcionalidade a um menor custo.
Porém, flexibilidade e confiabilidade têm sido encarados como conceitos
conflituais.
Isto deve-se ao facto de flexibilidade implicar a capacidade de permitir a
evolução dos requisitos que, por sua vez, podem levar a cenários de operação
imprevisÃveis e possivelmente inseguros. Desta fora, é comummente aceite
que apenas um sistema completamente estático pode ser tornado confiável, o
que significa que todos os aspectos operacionais têm de ser completamente
definidos durante a fase de projecto.
Num sentido lato, esta constatação é verdadeira. Contudo, se os modos como
o sistema se adapta a requisitos evolutivos puderem ser restringidos e
controlados, então talvez seja possÃvel garantir a confiabilidade permanente
apesar das alterações aos requisitos durante a fase de operação.
A tese suportada por esta dissertação defende que é possÃvel flexibilizar um
sistema, dentro de limites bem definidos, sem comprometer a sua
confiabilidade e propõe alguns mecanismos que permitem a construção de
sistemas de segurança crÃtica baseados no protocolo Controller Area Network
(CAN). Mais concretamente, o foco principal deste trabalho incide sobre o
protocolo Flexible Time-Triggered CAN (FTT-CAN), que foi especialmente
desenvolvido para disponibilizar um grande nÃvel de flexibilidade operacional
combinando, não só as vantagens dos paradigmas de transmissão de
mensagens baseados em eventos e em tempo, mas também a flexibilidade
associada ao escalonamento dinâmico do tráfego cuja transmissão é
despoletada apenas pela evolução do tempo.
Este facto condiciona e torna mais complexo o desenvolvimento de
mecanismos de tolerância a falhas para FTT-CAN do que para outros
protocolos como por exemplo, TTCAN ou FlexRay, nos quais existe um
conhecimento estático, antecipado e comum a todos os nodos, do
escalonamento de mensagens cuja transmissão é despoletada pela evolução
do tempo.
Contudo, e apesar desta complexidade adicional, este trabalho demonstra que
é possÃvel construir mecanismos de tolerância a falhas para FTT-CAN
preservando a sua flexibilidade operacional.
É também defendido nesta dissertação que um sistema baseado no protocolo
FTT-CAN e equipado com os mecanismos de tolerância a falhas propostos é
passÃvel de ser usado em aplicações de segurança crÃtica.
Esta afirmação é suportada, no âmbito do protocolo FTT-CAN, através da
definição de uma arquitectura tolerante a falhas integrando nodos com modos
de falha tipo falha-silêncio e nodos mestre replicados.
Os vários problemas resultantes da replicação dos nodos mestre são, também
eles, analisados e várias soluções são propostas para os obviar.
Concretamente, é proposto um protocolo que garante a consistência das
estruturas de dados replicadas a quando da sua actualização e um outro
protocolo que permite a transferência dessas estruturas de dados para um
nodo mestre que se encontre não sincronizado com os restantes depois de
inicializado ou reinicializado de modo assÃncrono.
Além disto, esta dissertação também discute o projecto de nodos FTT-CAN
que exibam um modo de falha do tipo falha-silêncio e propõe duas soluções
baseadas em componentes de hardware localizados no interface de rede de
cada nodo, para resolver este problema. Uma das soluções propostas baseiase
em bus guardians que permitem a imposição de comportamento falhasilêncio
nos nodos escravos e suportam o escalonamento dinâmico de tráfego
na rede. A outra solução baseia-se num interface de rede que arbitra o acesso
de dois microprocessadores ao barramento. Este interface permite que a
replicação interna de um nodo seja efectuada de forma transparente e
assegura um comportamento falha-silêncio quer no domÃnio temporal quer no
domÃnio do valor ao permitir transmissões do nodo apenas quando ambas as
réplicas coincidam no conteúdo das mensagens e nos instantes de
transmissão. Esta última solução está mais adaptada para ser usada nos
nodos mestre, contudo também poderá ser usada nos nodos escravo, sempre
que tal se revele fundamental.Distributed embedded systems (DES) have been widely used in the last few
decades in several application fields, ranging from industrial process control to
avionics and automotive systems. In fact, it is expectable that this trend will
continue over the years to come.
In some of these application domains the dependability requirements are of
utmost importance since failing to provide services in a timely and predictable
manner may cause important economic losses or even put human life in risk.
The adoption of the best practices in the design of distributed embedded
systems does not fully avoid the occurrence of faults, arising from the nondeterministic
behavior of the environment where each particular DES operates.
Thus, fault-tolerance mechanisms need to be included in the DES to prevent
possible faults leading to system failure.
To be effective, fault-tolerance mechanisms require an a priori knowledge of
the correct system behavior to be capable of distinguishing them from the
erroneous ones.
Traditionally, when designing fault-tolerance mechanisms, the a priori
knowledge means that all possible operational modes are known at system
design time and cannot adapt nor evolve during runtime. As a consequence,
systems designed according to this principle are either fully static or allow a
small number of operational modes only. Flexibility, however, is a desired
property in a system in order to support evolving requirements, simplify
maintenance and repair, and improve the efficiency in using system resources
by using only the resources that are effectively required at each instant. This
efficiency might impact positively on the system cost because with the same
resources one can add more functionality or one can offer the same
functionality with fewer resources.
However, flexibility and dependability are often regarded as conflicting
concepts. This is so because flexibility implies the ability to deal with evolving
requirements that, in turn, can lead to unpredictable and possibly unsafe
operating scenarios. Therefore, it is commonly accepted that only a fully static
system can be made dependable, meaning that all operating conditions are
completely defined at pre-runtime.
In the broad sense and assuming unbounded flexibility this assessment is true,
but if one restricts and controls the ways the system could adapt to evolving
requirements, then it might be possible to enforce continuous dependability.
This thesis claims that it is possible to provide a bounded degree of flexibility
without compromising dependability and proposes some mechanisms to build
safety-critical systems based on the Controller Area Network (CAN).
In particular, the main focus of this work is the Flexible Time-Triggered CAN
protocol (FTT-CAN), which was specifically developed to provide such high
level of operational flexibility, not only combining the advantages of time- and
event-triggered paradigms but also providing flexibility to the time-triggered
traffic. This fact makes the development of fault-tolerant mechanisms more
complex in FTT-CAN than in other protocols, such as TTCAN or FlexRay, in
which there is a priori static common knowledge of the time-triggered message
schedule shared by all nodes. Nevertheless, as it is demonstrated in this work,
it is possible to build fault-tolerant mechanisms for FTT-CAN that preserve its
high level of operational flexibility, particularly concerning the time-triggered
traffic. With such mechanisms it is argued that FTT-CAN is suitable for safetycritical
applications, too.
This claim was validated in the scope of the FTT-CAN protocol by presenting a
fault-tolerant system architecture with replicated masters and fail-silent nodes.
The specific problems and mechanisms related with master replication,
particularly a protocol to enforce consistency during updates of replicated data
structures and another protocol to transfer these data structures to an
unsynchronized node upon asynchronous startup or restart, are also
addressed.
Moreover, this thesis also discusses the implementations of fail-silence in FTTCAN
nodes and proposes two solutions, both based on hardware components
that are attached to the node network interface. One solution relies on bus
guardians that allow enforcing fail-silence in the time domain. These bus
guardians are adapted to support dynamic traffic scheduling and are fit for use
in FTT-CAN slave nodes, only. The other solution relies on a special network
interface, with duplicated microprocessor interface, that supports internal
replication of the node, transparently. In this case, fail-silence can be assured
both in the time and value domain since transmissions are carried out only if
both internal nodes agree on the transmission instant and message contents.
This solution is well adapted for use in the masters but it can also be used, if
desired, in slave nodes