822 research outputs found
Programming with process groups: Group and multicast semantics
Process groups are a natural tool for distributed programming and are increasingly important in distributed computing environments. Discussed here is a new architecture that arose from an effort to simplify Isis process group semantics. The findings include a refined notion of how the clients of a group should be treated, what the properties of a multicast primitive should be when systems contain large numbers of overlapping groups, and a new construct called the causality domain. A system based on this architecture is now being implemented in collaboration with the Chorus and Mach projects
Una implementación rápida de Parallel Snapshot Isolation
Grado en Ingeniería Informática, Facultad de Informática UCM, Departamento de Arquitectura de Computadores y Automática, Curso 2019/2020.Most distributed database systems offer weak consistency models in order to avoid the performance penalty of coordinating replicas. Ideally, distributed databases would offer strong consistency models, like serialisability, since they make it easy to verify application invariants, and free programmers from worrying about concurrency. However, implementing and scaling systems with strong consistency is difficult, since it usually requires global communication. Weak models, while easier to scale, impose on the programmers the need to reason about possible anomalies, and the need to implement conflict resolution mechanisms in application code.
Recently proposed consistency models, like Parallel Snapshot Isolation (PSI) and NonMonotonic Snapshot Isolation (NMSI), represent the strongest models that still allow to build scalable systems without global communication. They allow comparable performance to previous, weaker models, as well as similar abort rates. However, both models still provide weaker guarantees than serialisability, and may prove difficult to use in applications.
This work shows an approach to bridge the gap between PSI, NMSI and strong consistency models like serialisability. It introduces and implements fastPSI, a consistency protocol that allows the user to selectively enforce serialisability for certain executions, while retaining the scalability properties of weaker consistency models like PSI and NMSI. In addition, it features a comprehensive evaluation of fastPSI in comparison with other consistency protocols, both weak and strong, showing that fastPSI offers better performance than serialisability, while retaining the scalability of weaker protocols.La mayoría de las bases de datos distribuidas ofrecen modelos de consistencia débil, con la finalidad de evitar la penalización de rendimiento que supone la coordinación de las distintas réplicas. Idealmente, las bases de datos distribuidas ofrecerían modelos de consistencia fuerte, como serialisability, ya que facilitan la verificación de los invariantes de las aplicaciones, y permiten que los programadores no deban preocuparse sobre posibles problemas de concurrencia. Sin embargo, implementar sistemas escalables que con modelos de consistencia fuerte no es fácil, pues requieren el uso de comunicación global. Sin embargo, aunque los modelos de consistencia más débiles permiten sistemas más escalables, imponen en los programadores la necesidad de razonar sobre posibles anomalías, así como implementar mecanismos de resolución de conflictos en el código de las aplicaciones. Dos modelos de consistencia propuestos recientemente, Parallel Snapshot Isolation (PSI) y Non-Monotonic Snapshot Isolation (NMSI), representan los modelos más fuertes que permiten implementaciones escalables sin necesidad de comunicación global. Permiten, a su vez, implementar sistemas con rendimientos similares a aquellos con modelos más débiles, a la vez que mantienen tasas de cancelación de transacciones similares. Aun así, ambos modelos no logran ofrecer las mismas garantías que serialisability, por lo que pueden ser difíciles de usar desde el punto de vista de las aplicaciones. Este trabajo presenta una propuesta que busca acortar la distancia entre modelos como PSI y NMSI y modelos fuertes como serialisability. Con esa finalidad, este trabajo presenta fastPSI, un protocolo de consistencia que permite al usuario ejecutar de manera selectiva transacciones serializables, reteniendo a su vez las propiedades de escalabilidad propias de modelos de consistencia débiles como PSI o NMSI. Además, este trabajo cuenta con una evaluación exhaustiva de fastPSI, comparándolo con otros protocolos de consistencia, tanto fuertes como débiles. Se muestra así que fastPSI logra un rendimiento mayor que serialisability sin por ello renunciar a la escalabilidad de protocolos más débiles.Depto. de Arquitectura de Computadores y AutomáticaFac. de InformáticaTRUEunpu
Rigorous Design of Distributed Transactions
Database replication is traditionally envisaged as a way of increasing fault-tolerance and availability. It is advantageous to replicate the data when transaction workload is predominantly read-only. However, updating replicated data within a transactional framework is a complex affair due to failures and race conditions among conflicting transactions. This thesis investigates various mechanisms for the management of replicas in a large distributed system, formalizing and reasoning about the behavior of such systems using Event-B. We begin by studying current approaches for the management of replicated data and explore the use of broadcast primitives for processing transactions. Subsequently, we outline how a refinement based approach can be used for the development of a reliable replicated database system that ensures atomic commitment of distributed transactions using ordered broadcasts. Event-B is a formal technique that consists of describing rigorously the problem in an abstract model, introducing solutions or design details in refinement steps to obtain more concrete specifications, and verifying that the proposed solutions are correct. This technique requires the discharge of proof obligations for consistency checking and refinement checking. The B tools provide significant automated proof support for generation of the proof obligations and discharging them. The majority of the proof obligations are proved by the automatic prover of the tools. However, some complex proof obligations require interaction with the interactive prover. These proof obligations also help discover new system invariants. The proof obligations and the invariants help us to understand the complexity of the problem and the correctness of the solutions. They also provide a clear insight into the system and enhance our understanding of why a design decision should work. The objective of the research is to demonstrate a technique for the incremental construction of formal models of distributed systems and reasoning about them, to develop the technique for the discovery of gluing invariants due to prover failure to automatically discharge a proof obligation and to develop guidelines for verification of distributed algorithms using the technique of abstraction and refinement
Multi-value distributed key-value stores
Tese de Doutoramento em InformaticsMany large scale distributed data stores rely on optimistic replication to
scale and remain highly available in the face of network partitions. Managing
data without strong coordination results in eventually consistent
data stores that allow for concurrent data updates. To allow writing applications
in the absence of linearizability or transactions, the seminal
Dynamo data store proposed a multi-value API in which a get returns
the set of concurrent written values. In this scenario, it is important to
be able to accurately and efficiently identify updates executed concurrently.
Logical clocks are often used to track data causality, necessary
to distinguish concurrent from causally related writes on the same key.
However, in traditional mechanisms there is a non-negligible metadata
overhead per key, which also keeps growing with time, proportional to
the node churn rate. Another challenge is deleting keys while respecting
causality: while the values can be deleted, per-key metadata cannot
be permanently removed in current data stores.
These systems often use anti-entropy mechanisms (like Merkle Trees)
to detect and repair divergent data versions across nodes. However,
in practice hash-based data structures are not suitable to a store using
consistent hashing and create too many false positives.
Also, highly available systems usually provide eventual consistency,
which is the weakest form of consistency. This results in a programming
model difficult to use and to reason about. It has been proved that
causal consistency is the strongest consistency model achievable if we
want highly available services. It provides better programming semantics
such as sessions guarantees. However, classical causal consistency
is a memory model that that is problematic for concurrent updates, in
the absence of concurrency control primitives. Used in eventually consistent
data stores, it leads to arbitrating between concurrent updates
which leads to data loss. We propose three novel techniques in this thesis. The first is Dotted
Version Vectors: a solution that combines a new logical clock mechanism
and a request handling workflow that together support the traditional
Dynamo key-value store API while capturing causality in an
accurate and scalable way, avoiding false conflicts. It maintains concise
information per version, linear only on the number of replicas, and includes
a container data structure that allows sets of concurrent versions
to be merged efficiently, with time complexity linear on the number of
replicas plus versions.
The second is DottedDB: a Dynamo-like key-value store, which uses
a novel node-wide logical clock framework, overcoming three fundamental
limitations of the state of the art: (1) minimize the metadata per
key necessary to track causality, avoiding its growth even in the face
of node churn; (2) correctly and durably delete keys, with no need for
tombstones; (3) offer a lightweight anti-entropy mechanism to converge
replicated data, avoiding the need for Merkle Trees.
The third and final contribution is Causal Multi-Value Consistency: a
novel consistency model that respects the causality of client operations
while properly supporting concurrent updates without arbitration, by
having the same Dynamo-like multi-value nature. In addition, we extend
this model to provide the same semantics with read and write
transactions. For both models, we define an efficient implementation
on top of a distributed key-value store.Várias bases de dados de larga escala usam técnicas de replicação otimista
para escalar e permanecer altamente disponíveis face a falhas e partições
na rede. Gerir os dados sem coordenação forte entre os nós
do servidor e o cliente resulta em bases de dados "inevitavelmente coerentes"
que permitem escritas de dados concorrentes. Para permitir
que aplicações escrevam na base de dados na ausência de transações
e mecanismos de coerência forte, a influente base de dados Dynamo
propôs uma interface multi-valor, que permite a uma leitura devolver
um conjunto de valores escritos concorrentemente para a mesma chave.
Neste cenário, é importante identificar com exatidão e eficiência quais
as escritas efetuadas numa chave de forma potencialmente concorrente.
Relógios lógicos são normalmente usados para gerir a causalidade das
chaves, de forma a detetar escritas causalmente concorrentes na mesma
chave. No entanto, mecanismos tradicionais adicionam metadados cujo
tamanho cresce proporcionalmente com a entrada e saída de nós no
servidor. Outro desafio é a remoção de chaves do sistema, respeitando
a causalidade e ao mesmo tempo não deixando metadados permanentes
no servidor.
Estes sistemas de dados utilizam também mecanismos de anti-entropia
(tais como Merkle Trees) para detetar e reparar dados replicados em diferentes
nós que divirjam. No entanto, na prática estas estruturas de dados
baseadas em hashes não são adequados para sistemas que usem hashing
consistente para a partição de dados e resultam em muitos falsos positivos.
Outro aspeto destes sistemas é o facto de normalmente apenas suportarem
coerência inevitável, que é a garantia mais fraca em termos
de coerência de dados. Isto resulta num modelo de programação difícil
de usar e compreender. Foi provado que coerência causal é a forma
mais forte de coerência de dados que se consegue fornecer, de forma a que se consiga também ser altamente disponível face a falhas. Este
modelo fornece uma semântica mais interessante ao cliente do sistema,
nomeadamente as garantias de sessão. No entanto, a coerência causal
tradicional é definida sobre um modelo de memória não apropriado
para escritas concorrentes não controladas. Isto leva a que se arbitre
um vencedor quando escritas acontecem concorrentemente, levando a
perda de dados.
Propomos nesta tese três novas técnicas. A primeira chama-se Dotted
Version Vectors: uma solução que combina um novo mecanismo de
relógios lógicos com uma interação entre o cliente e o servidor, que permitem
fornecer uma interface multi-valor ao cliente similar ao Dynamo
de forma eficiente e escalável, sem falsos conflitos. O novo relógio lógico
mantém informação precisa por versão de uma chave, de tamanho linear
no número de réplicas da chave no sistema. Permite também que
versão diferentes sejam corretamente e eficientemente reunidas.
A segunda contribuição chama-se DottedDB: uma base de dados similar
ao Dynamo, mas que implementa um novo mecanismo de relógios
lógicos ao nível dos nós, que resolve três limitações fundamentais do estado
da arte: (1) minimiza os metadados necessários manter por chave
para gerir a causalidade, evitando o seu crescimento com a entrada e
saída de nós; (2) permite remover chaves de forma permanente, sem
a necessidade de manter metadados indefinidamente no servidor; (3)
um novo protocolo de anti-entropia para reparar dados replicados, de
modo a que todas as réplicas na base de dados convirjam, sem que seja
necessário operações dispendiosas como as usadas com Merkle Trees.
A terceira e última contribuição é Coerência Causal Multi-Valor: um
novo modelo de coerência de dados que respeita a causalidade das operações
efetuadas pelos clientes e que também suporta operações concorrentes,
sem que seja necessário arbitrar um vencedor entre as escritas,
seguindo o espírito da interface multi-valor do Dynamo. Adicionalmente,
estendemos este modelo para fornecer transações de escritas ou
leituras, respeitando a mesma semântica da causalidade. Para ambos
os modelos, definimos uma implementação eficiente em cima de uma
base de dados distribuída.Fundação para a Ciência e Tecnologia (FCT) - with the research grant SFRH/BD/86735/201
Monotonic Prefix Consistency in Distributed Systems
We study the issue of data consistency in distributed systems. Specifically,
we consider a distributed system that replicates its data at multiple sites,
which is prone to partitions, and which is assumed to be available (in the
sense that queries are always eventually answered). In such a setting, strong
consistency, where all replicas of the system apply synchronously every
operation, is not possible to implement. However, many weaker consistency
criteria that allow a greater number of behaviors than strong consistency, are
implementable in available distributed systems. We focus on determining the
strongest consistency criterion that can be implemented in a convergent and
available distributed system that tolerates partitions. We focus on objects
where the set of operations can be split into updates and queries. We show that
no criterion stronger than Monotonic Prefix Consistency (MPC) can be
implemented.Comment: Submitted pape
- …