35 research outputs found
Optimal self-stabilizing mobile byzantine-tolerant regular register with bounded timestamps
This paper proposes the first implementation of a self-stabilizing regular register emulated by n servers that is tolerant to both Mobile Byzantine Agents and transient failures in a round-free synchronous model. Differently from existing Mobile Byzantine Tolerant register implementations, this paper considers a weaker model where: (i) the computation of the servers is decoupled from the movements of the Byzantine agents, i.e., movements may happen before, concurrently, or after the generation or the delivery of a message, and (ii) servers are not aware of their failure state i.e., they do not know if and when they have been corrupted by a Mobile Byzantine agent. The proposed protocol tolerates (i) any finite number of transient failures, and (ii) up to f Mobile Byzantine agents. In addition, our implementation uses bounded timestamps from the Z13 domain and it is optimal with respect to the number of servers needed to tolerate f Mobile Byzantine agents in the given model (i.e., n>6f when Δ=2δ, and n>8f when Δ=δ, where Δ represents the period at which the Byzantine agents move and δ is the upper bound on the communication latency)
Reliable Broadcast despite Mobile Byzantine Faults
We investigate the solvability of the Byzantine Reliable Broadcast and
Byzantine Broadcast Channel problems in distributed systems affected by Mobile
Byzantine Faults. We show that both problems are not solvable even in one of
the most constrained system models for mobile Byzantine faults defined so far.
By endowing processes with an additional local failure oracle, we provide a
solution to the Byzantine Broadcast Channel problem
Advanced information processing system: The Army fault tolerant architecture conceptual study. Volume 2: Army fault tolerant architecture design and analysis
Described here is the Army Fault Tolerant Architecture (AFTA) hardware architecture and components and the operating system. The architectural and operational theory of the AFTA Fault Tolerant Data Bus is discussed. The test and maintenance strategy developed for use in fielded AFTA installations is presented. An approach to be used in reducing the probability of AFTA failure due to common mode faults is described. Analytical models for AFTA performance, reliability, availability, life cycle cost, weight, power, and volume are developed. An approach is presented for using VHSIC Hardware Description Language (VHDL) to describe and design AFTA's developmental hardware. A plan is described for verifying and validating key AFTA concepts during the Dem/Val phase. Analytical models and partial mission requirements are used to generate AFTA configurations for the TF/TA/NOE and Ground Vehicle missions
Tractable reliable communication in compromised networks
Reliable communication is a fundamental primitive in distributed systems prone to Byzantine (i.e. arbitrary, and possibly malicious) failures to guarantee the integrity, delivery, and authorship of the messages exchanged between processes. Its practical adoption strongly depends on the system assumptions. Several solutions have been proposed so far in the literature implementing such a primitive, but some lack in scalability and/or demand topological network conditions computationally hard to be verified.
This thesis aims to investigate and address some of the open problems and challenges implementing such a communication primitive. Specifically, we analyze how a reliable communication primitive can be implemented in 1) a static distributed system where a subset of processes is compromised, 2) a dynamic distributed system where part of the processes is Byzantine faulty, and 3) a static distributed system where every process can be compromised and recover.
We define several more efficient protocols and we characterize alternative network conditions guaranteeing their correctness
Forwarding fault detection in wireless community networks
Wireless community networks (WCN) are specially vulnerable to routing forwarding failures because of their intrinsic characteristics: use of inexpensive hardware that can be easily accessed; managed in a decentralized way, sometimes by non-expert administrators, and open to everyone; making it prone to hardware failures, misconfigurations and malicious attacks. To increase routing robustness in WCN, we propose a detection mechanism to detect faulty routers, so that the problem can be tackled. Forwarding fault detection can be explained as a 4 steps process: first, there is the need of monitoring and summarizing the traffic observed; then, the traffic summaries are shared among peers, so that evaluation of a router's behavior can be done by analyzing all the relevant traffic summaries; finally, once the faulty nodes have been detected a response mechanism is triggered to solve the issue. The contributions of this thesis focus on the first three steps of this process, providing solutions adapted to Wireless Community Networks that can be deployed without the need of modifying its current network stack. First, we study and characterize the distribution of the error of sketches, a traffic summary function that is resilient to packet dropping, modification and creation and provides better estimations than sampling. We define a random process to describe the estimation for each sketch type, which allows us to provide tighter bounds on the sketch accuracy and choose the size of the sketch more accurately for a set of given requirements on the estimation accuracy. Second, we propose KDet, a traffic summary dissemination and detection protocol that, unlike previous solutions, is resilient to collusion and false accusation without the need of knowing a packet's path. Finally, we consider the case of nodes with unsynchronized clocks and we propose a traffic validation mechanism based on sketches that is capable of discerning between faulty and non-faulty nodes even when the traffic summaries are misaligned, i.e. they refer to slightly different intervals of time.Las redes comunitarias son especialmente vulnerables a errores en la retransmisión de paquetes de red, puesto que están formadas por equipos de gama baja, que pueden ser fácilmente accedidos por extraños; están gestionados de manera distribuida y no siempre por expertos, y además están abiertas a todo el mundo; con lo que de manera habitual presentan errores de hardware o configuración y son sensibles a ataques maliciosos. Para mejorar la robustez en el enrutamiento en estas redes, proponemos el uso de un mecanismo de detección de routers defectuosos, para asà poder corregir el problema. La detección de fallos de enrutamiento se puede explicar como un proceso de 4 pasos: el primero es monitorizar el tráfico existente, manteniendo desde cada punto de observación un resumen sobre el tráfico observado; después, estos resumenes se comparten entre los diferentes nodos, para que podamos llevar a cabo el siguiente paso: la evaluación del comportamiento de cada nodo. Finalmente, una vez hemos detectado los nodos maliciosos o que fallan, debemos actuar con un mecanismo de respuesta que corrija el problema. Esta tesis se concentra en los tres primeros pasos, y proponemos una solución para cada uno de ellos que se adapta al contexto de las redes comunitarias, de tal manera que se puede desplegar en ellas sin la necesidad de modificar los sistemas y protocolos de red ya existentes. Respecto a los resumenes de tráfico, presentamos un estudio y caracterización de la distribución de error de los sketches, una estructura de datos que es capaz de resumir flujos de tráfico resistente a la pérdida, manipulación y creación de paquetes y que además tiene mejor resolución que el muestreo. Para cada tipo de sketch, definimos una función de distribución que caracteriza el error cometido, de esta manera somos capaces de determinar con más precisión el tamaño del sketch requerido bajo unos requisitos de falsos positivos y negativos. Después proponemos KDet, un protocolo de diseminación de resumenes de tráfico y detección de nodos erróneos que, a diferencia de protocolos propuestos anteriormente, no require conocer el camino de cada paquete y es resistente a la confabulación de nodos maliciosos. Por último, consideramos el caso de nodos con relojes desincronizados, y proponemos un mecanismo de detección basado en sketches, capaz de discernir entre los nodos erróneos y correctos, aún a pesar del desalineamiento de los sketches (es decir, a pesar del que estos se refieran a momentos de tiempo ligeramente diferentes)
Fault-Tolerant Distributed Services in Message-Passing Systems
Distributed systems ranging from small local area networks to large wide area networks like the Internet composed of static and/or mobile users have become increasingly popular. A desirable property for any distributed service is fault-tolerance, which means the service remains uninterrupted even if some components in the network fail. This dissertation considers weak distributed models to find either algorithms to solve certain problems or impossibility proofs to show that a problem is unsolvable. These are the main contributions of this dissertation:
• Failure detectors are used as a service to solve consensus (agreement among nodes) which is otherwise impossible in failure-prone asynchronous systems. We find an algorithm for crash-failure detection that uses bounded size messages in an arbitrary, partitionable network composed of badly- behaved channels that can lose and reorder messages.
• Registers are a fundamental building block for shared memory emulations on top of message passing systems. The problem has been extensively studied in static systems. However, register emulation in dynamic systems with faulty nodes is still quite hard and there are impossibility proofs that point out scenarios where change in the system composition due to nodes entering and leaving (also called churn) makes the problem unsolvable.
We propose the first emulation of a crash-fault tolerant register in a system with continuous churn where consensus is unsolvable, the size of the system can grow without bound and at most a constant fraction of the number of nodes in the system can fail by crashing. We prove a lower bound that states that fault-tolerance for dynamic systems with churn is inherently lower than in static systems.
• We then extend the results in the crash-fault tolerant case to a dynamic system with continuous churn and nodes that can be Byzantine faulty. It is the first emulation of an atomic register in a system that can withstand nodes continually entering and leaving, imposes no upper bound on the system size and can tolerate Byzantine nodes. However, the number of Byzantine faulty nodes that can be tolerated is upper bounded by a constant number. Although the algorithm requires that there be a constant known upper bound on the number of Byzantine nodes, this restriction is unavoidable, as we show that it is impossible to emulate an atomic register if the system size and maximum number of servers that can be Byzantine in the system is unknown
A Critique of the CAP Theorem
The CAP Theorem is a frequently cited impossibility result in distributed
systems, especially among NoSQL distributed databases. In this paper we survey
some of the confusion about the meaning of CAP, including inconsistencies and
ambiguities in its definitions, and we highlight some problems in its
formalization. CAP is often interpreted as proof that eventually consistent
databases have better availability properties than strongly consistent
databases; although there is some truth in this, we show that more careful
reasoning is required. These problems cast doubt on the utility of CAP as a
tool for reasoning about trade-offs in practical systems. As alternative to
CAP, we propose a "delay-sensitivity" framework, which analyzes the sensitivity
of operation latency to network delay, and which may help practitioners reason
about the trade-offs between consistency guarantees and tolerance of network
faults
Tolerância a falhas em sistemas de comunicação de tempo-real flexÃveis
Nas últimas décadas, os sistemas embutidos distribuÃdos, têm sido usados em
variados domÃnios de aplicação, desde o controlo de processos industriais até
ao controlo de aviões e automóveis, sendo expectável que esta tendência se
mantenha e até se intensifique durante os próximos anos.
Os requisitos de confiabilidade de algumas destas aplicações são
extremamente importantes, visto que o não cumprimento de serviços de uma
forma previsÃvel e pontual pode causar graves danos económicos ou até pôr
em risco vidas humanas.
A adopção das melhores práticas de projecto no desenvolvimento destes
sistemas não elimina, por si só, a ocorrência de falhas causadas pelo
comportamento não determinÃstico do ambiente onde o sistema embutido
distribuÃdo operará. Desta forma, é necessário incluir mecanismos de
tolerância a falhas que impeçam que eventuais falhas possam comprometer
todo o sistema.
Contudo, para serem eficazes, os mecanismos de tolerância a falhas
necessitam ter conhecimento a priori do comportamento correcto do sistema
de modo a poderem ser capazes de distinguir os modos correctos de
funcionamento dos incorrectos.
Tradicionalmente, quando se projectam mecanismos de tolerância a falhas, o
conhecimento a priori significa que todos os possÃveis modos de
funcionamento são conhecidos na fase de projecto, não os podendo adaptar
nem fazer evoluir durante a operação do sistema. Como consequência, os
sistemas projectados de acordo com este princÃpio ou são completamente
estáticos ou permitem apenas um pequeno número de modos de operação.
Contudo, é desejável que os sistemas disponham de alguma flexibilidade de
modo a suportarem a evolução dos requisitos durante a fase de operação,
simplificar a manutenção e reparação, bem como melhorar a eficiência usando
apenas os recursos do sistema que são efectivamente necessários em cada
instante. Além disto, esta eficiência pode ter um impacto positivo no custo do
sistema, em virtude deste poder disponibilizar mais funcionalidades com o
mesmo custo ou a mesma funcionalidade a um menor custo.
Porém, flexibilidade e confiabilidade têm sido encarados como conceitos
conflituais.
Isto deve-se ao facto de flexibilidade implicar a capacidade de permitir a
evolução dos requisitos que, por sua vez, podem levar a cenários de operação
imprevisÃveis e possivelmente inseguros. Desta fora, é comummente aceite
que apenas um sistema completamente estático pode ser tornado confiável, o
que significa que todos os aspectos operacionais têm de ser completamente
definidos durante a fase de projecto.
Num sentido lato, esta constatação é verdadeira. Contudo, se os modos como
o sistema se adapta a requisitos evolutivos puderem ser restringidos e
controlados, então talvez seja possÃvel garantir a confiabilidade permanente
apesar das alterações aos requisitos durante a fase de operação.
A tese suportada por esta dissertação defende que é possÃvel flexibilizar um
sistema, dentro de limites bem definidos, sem comprometer a sua
confiabilidade e propõe alguns mecanismos que permitem a construção de
sistemas de segurança crÃtica baseados no protocolo Controller Area Network
(CAN). Mais concretamente, o foco principal deste trabalho incide sobre o
protocolo Flexible Time-Triggered CAN (FTT-CAN), que foi especialmente
desenvolvido para disponibilizar um grande nÃvel de flexibilidade operacional
combinando, não só as vantagens dos paradigmas de transmissão de
mensagens baseados em eventos e em tempo, mas também a flexibilidade
associada ao escalonamento dinâmico do tráfego cuja transmissão é
despoletada apenas pela evolução do tempo.
Este facto condiciona e torna mais complexo o desenvolvimento de
mecanismos de tolerância a falhas para FTT-CAN do que para outros
protocolos como por exemplo, TTCAN ou FlexRay, nos quais existe um
conhecimento estático, antecipado e comum a todos os nodos, do
escalonamento de mensagens cuja transmissão é despoletada pela evolução
do tempo.
Contudo, e apesar desta complexidade adicional, este trabalho demonstra que
é possÃvel construir mecanismos de tolerância a falhas para FTT-CAN
preservando a sua flexibilidade operacional.
É também defendido nesta dissertação que um sistema baseado no protocolo
FTT-CAN e equipado com os mecanismos de tolerância a falhas propostos é
passÃvel de ser usado em aplicações de segurança crÃtica.
Esta afirmação é suportada, no âmbito do protocolo FTT-CAN, através da
definição de uma arquitectura tolerante a falhas integrando nodos com modos
de falha tipo falha-silêncio e nodos mestre replicados.
Os vários problemas resultantes da replicação dos nodos mestre são, também
eles, analisados e várias soluções são propostas para os obviar.
Concretamente, é proposto um protocolo que garante a consistência das
estruturas de dados replicadas a quando da sua actualização e um outro
protocolo que permite a transferência dessas estruturas de dados para um
nodo mestre que se encontre não sincronizado com os restantes depois de
inicializado ou reinicializado de modo assÃncrono.
Além disto, esta dissertação também discute o projecto de nodos FTT-CAN
que exibam um modo de falha do tipo falha-silêncio e propõe duas soluções
baseadas em componentes de hardware localizados no interface de rede de
cada nodo, para resolver este problema. Uma das soluções propostas baseiase
em bus guardians que permitem a imposição de comportamento falhasilêncio
nos nodos escravos e suportam o escalonamento dinâmico de tráfego
na rede. A outra solução baseia-se num interface de rede que arbitra o acesso
de dois microprocessadores ao barramento. Este interface permite que a
replicação interna de um nodo seja efectuada de forma transparente e
assegura um comportamento falha-silêncio quer no domÃnio temporal quer no
domÃnio do valor ao permitir transmissões do nodo apenas quando ambas as
réplicas coincidam no conteúdo das mensagens e nos instantes de
transmissão. Esta última solução está mais adaptada para ser usada nos
nodos mestre, contudo também poderá ser usada nos nodos escravo, sempre
que tal se revele fundamental.Distributed embedded systems (DES) have been widely used in the last few
decades in several application fields, ranging from industrial process control to
avionics and automotive systems. In fact, it is expectable that this trend will
continue over the years to come.
In some of these application domains the dependability requirements are of
utmost importance since failing to provide services in a timely and predictable
manner may cause important economic losses or even put human life in risk.
The adoption of the best practices in the design of distributed embedded
systems does not fully avoid the occurrence of faults, arising from the nondeterministic
behavior of the environment where each particular DES operates.
Thus, fault-tolerance mechanisms need to be included in the DES to prevent
possible faults leading to system failure.
To be effective, fault-tolerance mechanisms require an a priori knowledge of
the correct system behavior to be capable of distinguishing them from the
erroneous ones.
Traditionally, when designing fault-tolerance mechanisms, the a priori
knowledge means that all possible operational modes are known at system
design time and cannot adapt nor evolve during runtime. As a consequence,
systems designed according to this principle are either fully static or allow a
small number of operational modes only. Flexibility, however, is a desired
property in a system in order to support evolving requirements, simplify
maintenance and repair, and improve the efficiency in using system resources
by using only the resources that are effectively required at each instant. This
efficiency might impact positively on the system cost because with the same
resources one can add more functionality or one can offer the same
functionality with fewer resources.
However, flexibility and dependability are often regarded as conflicting
concepts. This is so because flexibility implies the ability to deal with evolving
requirements that, in turn, can lead to unpredictable and possibly unsafe
operating scenarios. Therefore, it is commonly accepted that only a fully static
system can be made dependable, meaning that all operating conditions are
completely defined at pre-runtime.
In the broad sense and assuming unbounded flexibility this assessment is true,
but if one restricts and controls the ways the system could adapt to evolving
requirements, then it might be possible to enforce continuous dependability.
This thesis claims that it is possible to provide a bounded degree of flexibility
without compromising dependability and proposes some mechanisms to build
safety-critical systems based on the Controller Area Network (CAN).
In particular, the main focus of this work is the Flexible Time-Triggered CAN
protocol (FTT-CAN), which was specifically developed to provide such high
level of operational flexibility, not only combining the advantages of time- and
event-triggered paradigms but also providing flexibility to the time-triggered
traffic. This fact makes the development of fault-tolerant mechanisms more
complex in FTT-CAN than in other protocols, such as TTCAN or FlexRay, in
which there is a priori static common knowledge of the time-triggered message
schedule shared by all nodes. Nevertheless, as it is demonstrated in this work,
it is possible to build fault-tolerant mechanisms for FTT-CAN that preserve its
high level of operational flexibility, particularly concerning the time-triggered
traffic. With such mechanisms it is argued that FTT-CAN is suitable for safetycritical
applications, too.
This claim was validated in the scope of the FTT-CAN protocol by presenting a
fault-tolerant system architecture with replicated masters and fail-silent nodes.
The specific problems and mechanisms related with master replication,
particularly a protocol to enforce consistency during updates of replicated data
structures and another protocol to transfer these data structures to an
unsynchronized node upon asynchronous startup or restart, are also
addressed.
Moreover, this thesis also discusses the implementations of fail-silence in FTTCAN
nodes and proposes two solutions, both based on hardware components
that are attached to the node network interface. One solution relies on bus
guardians that allow enforcing fail-silence in the time domain. These bus
guardians are adapted to support dynamic traffic scheduling and are fit for use
in FTT-CAN slave nodes, only. The other solution relies on a special network
interface, with duplicated microprocessor interface, that supports internal
replication of the node, transparently. In this case, fail-silence can be assured
both in the time and value domain since transmissions are carried out only if
both internal nodes agree on the transmission instant and message contents.
This solution is well adapted for use in the masters but it can also be used, if
desired, in slave nodes