126 research outputs found
Improving the Latency and Throughput of ZooKeeper Atomic Broadcast
ZooKeeper is a crash-tolerant system that offers fundamental services to Internet-scale applications, thereby reducing the development and hosting of the latter. It consists of >3 servers that form a replicated state machine. Maintaining these replicas in a mutually consistent state requires executing an Atomic Broadcast Protocol, Zab, so that concurrent requests for state changes are serialised identically at all replicas before being acted upon.
Thus, ZooKeeper performance for update operations is determined by Zab performance. We contribute by presenting two easy-to-implement Zab variants, called ZabAC and ZabAA. They are designed to offer small atomic-broadcast latencies and to reduce the processing load on the primary node that plays a leading role in Zab. The former improves ZooKeeper performance and the latter enables ZooKeeper
to face more challenging load conditions
Mechanisms for improving ZooKeeper Atomic Broadcast performance
PhD ThesisCoordination services are essential for building higher-level primitives that are often
used in today’s data-center infrastructures, as they greatly facilitate the operation of
distributed client applications. Examples of typical functionalities offered by coordination
services include the provision of group membership, support for leader election,
distributed synchronization, as well as reliable low-volume storage and naming.
To provide reliable services to the client applications, coordination services in general
are replicated for fault tolerance and should deliver high performance to ensure that
they do not become bottlenecks for dependent applications. Apache ZooKeeper, for
example, is a well-known coordination service and applies a primary-backup approach
in which the leader server processes all state-modifying requests and then forwards
the corresponding state updates to a set of follower servers using an atomic broadcast
protocol called Zab.
Having analyzed state-of-the-art coordination services, we identified two main
limitations that prevent existing systems such as Apache ZooKeeper from achieving a
higher write performance: First, while this approach prevents the data stored by client
applications from being lost as a result of server crashes, it also comes at the cost of a
performance penalty. In particular, the fact that it relies on a leader-based protocol,
means that its performance becomes bottlenecked when the leader server has to handle
an increased message traffic as the number of client requests and replicas increases.
Second, Zab requires significant communication between instances (as it entails three
communication steps). This can potentially lead to performance overhead and uses up
more computer resources, resulting in less guarantees for users who must then build
more complex applications to handle these issues.
To this end, the work makes four contributions. First, we implement ZooKeeper
atomic broadcast, extracting from ZooKeeper in order to make it easier for other
developers to build their applications on top of Zab without the complexity of integrating
the entire ZooKeeper codebase. Second, we propose three variations of Zab, which
are all capable of reaching an agreement in fewer communication steps than Zab. The
v
variations are built with restriction assumptions that server crashes are independent
and a server quorum remains operative at all times. The first variation offers excellent
performance but can only be used for 3-server systems; the other two are built without
this limitation. Then, we redesigned the latest two Zab variations to operate under the
least-restricted Zab fault assumptions. Third, we design and implement a ZooKeeper
coin-tossing protocol, called ZabCT which addresses the above concerns by having the
other, non-leader server replicas toss a coin and broadcast their acknowledgment of a
leader’s proposal only if the toss results in an outcome of Head. We model the ZabCT
process and derive analytical expressions for estimating the coin-tossing probability
of Head for a given arrival rate of service requests such that the dual objectives of
performance gains and traffic reduction can be accomplished. If a coin-tossing protocol,
ZabCT is judged not to offer performance benefits over Zab, processes should be able to
switch autonomously to Zab. We design protocol switching by letting processes switch
between ZabCT and Zab without stopping message delivery. Finally, an extensive
performance evaluation is provided for Zab and Zab-variant protocols
Incremental Consistency Guarantees for Replicated Objects
Programming with replicated objects is difficult. Developers must face the
fundamental trade-off between consistency and performance head on, while
struggling with the complexity of distributed storage stacks. We introduce
Correctables, a novel abstraction that hides most of this complexity, allowing
developers to focus on the task of balancing consistency and performance. To
aid developers with this task, Correctables provide incremental consistency
guarantees, which capture successive refinements on the result of an ongoing
operation on a replicated object. In short, applications receive both a
preliminary---fast, possibly inconsistent---result, as well as a
final---consistent---result that arrives later.
We show how to leverage incremental consistency guarantees by speculating on
preliminary values, trading throughput and bandwidth for improved latency. We
experiment with two popular storage systems (Cassandra and ZooKeeper) and three
applications: a Twissandra-based microblogging service, an ad serving system,
and a ticket selling system. Our evaluation on the Amazon EC2 platform with
YCSB workloads A, B, and C shows that we can reduce the latency of strongly
consistent operations by up to 40% (from 100ms to 60ms) at little cost (10%
bandwidth increase, 6% throughput drop) in the ad system. Even if the
preliminary result is frequently inconsistent (25% of accesses), incremental
consistency incurs a bandwidth overhead of only 27%.Comment: 16 total pages, 12 figures. OSDI'16 (to appear
HT-Paxos: High Throughput State-Machine Replication Protocol for Large Clustered Data Centers
Paxos is a prominent theory of state machine replication. Recent data
intensive Systems those implement state machine replication generally require
high throughput. Earlier versions of Paxos as few of them are classical Paxos,
fast Paxos and generalized Paxos have a major focus on fault tolerance and
latency but lacking in terms of throughput and scalability. A major reason for
this is the heavyweight leader. Through offloading the leader, we can further
increase throughput of the system. Ring Paxos, Multi Ring Paxos and S-Paxos are
few prominent attempts in this direction for clustered data centers. In this
paper, we are proposing HT-Paxos, a variant of Paxos that one is the best
suitable for any large clustered data center. HT-Paxos further offloads the
leader very significantly and hence increases the throughput and scalability of
the system. While at the same time, among high throughput state-machine
replication protocols, HT-Paxos provides reasonably low latency and response
time
FairLedger: A Fair Blockchain Protocol for Financial Institutions
Financial institutions are currently looking into technologies for
permissioned blockchains. A major effort in this direction is Hyperledger, an
open source project hosted by the Linux Foundation and backed by a consortium
of over a hundred companies. A key component in permissioned blockchain
protocols is a byzantine fault tolerant (BFT) consensus engine that orders
transactions. However, currently available BFT solutions in Hyperledger (as
well as in the literature at large) are inadequate for financial settings; they
are not designed to ensure fairness or to tolerate selfish behavior that arises
when financial institutions strive to maximize their own profit.
We present FairLedger, a permissioned blockchain BFT protocol, which is fair,
designed to deal with rational behavior, and, no less important, easy to
understand and implement. The secret sauce of our protocol is a new
communication abstraction, called detectable all-to-all (DA2A), which allows us
to detect participants (byzantine or rational) that deviate from the protocol,
and punish them. We implement FairLedger in the Hyperledger open source
project, using Iroha framework, one of the biggest projects therein. To
evaluate FairLegder's performance, we also implement it in the PBFT framework
and compare the two protocols. Our results show that in failure-free scenarios
FairLedger achieves better throughput than both Iroha's implementation and PBFT
in wide-area settings
Arquitetura de elevada disponibilidade para bases de dados na cloud
Dissertação de mestrado em Computer ScienceCom a constante expansão de sistemas informáticos nas diferentes áreas de aplicação, a
quantidade de dados que exigem persistência aumenta exponencialmente. Assim, por
forma a tolerar faltas e garantir a disponibilidade de dados, devem ser implementadas
técnicas de replicação.
Atualmente existem várias abordagens e protocolos, tendo diferentes tipos de aplicações
em vista. Existem duas grandes vertentes de protocolos de replicação, protocolos genéricos,
para qualquer serviço, e protocolos específicos destinados a bases de dados. No que toca
a protocolos de replicação genéricos, as principais técnicas existentes, apesar de completa mente desenvolvidas e em utilização, têm algumas limitações, nomeadamente: problemas
de performance relativamente a saturação da réplica primária na replicação passiva e o
determinismo necessário associado à replicação ativa. Algumas destas desvantagens são
mitigadas pelos protocolos específicos de base de dados (e.g., com recurso a multi-master)
mas estes protocolos não permitem efetuar uma separação entre a lógica da replicação e
os respetivos dados. Abordagens mais recentes tendem a basear-se em técnicas de repli cação com fundamentos em mecanismos distribuídos de logging. Tais mecanismos propor cionam alta disponibilidade de dados e tolerância a faltas, permitindo abordagens inovado ras baseadas puramente em logs.
Por forma a atenuar as limitações encontradas não só no mecanismo de replicação ativa
e passiva, mas também nas suas derivações, esta dissertação apresenta uma solução de
replicação híbrida baseada em middleware, o SQLware. A grande vantagem desta abor dagem baseia-se na divisão entre a camada de replicação e a camada de dados, utilizando
um log distribuído altamente escalável que oferece tolerância a faltas e alta disponibilidade.
O protótipo desenvolvido foi validado com recurso à execução de testes de desempenho,
sendo avaliado em duas infraestruturas diferentes, nomeadamente, um servidor privado
de média gama e um grupo de servidores de computação de alto desempenho. Durante a
avaliação do protótipo, o standard da indústria TPC-C, tipicamente utilizado para avaliar
sistemas de base de dados transacionais, foi utilizado. Os resultados obtidos demonstram
que o SQLware oferece uma aumento de throughput de 150 vezes, comparativamente ao
mecanismo de replicação nativo da base de dados considerada, o PostgreSQL.With the constant expansion of computational systems, the amount of data that requires
durability increases exponentially. All data persistence must be replicated in order to provide high-availability and fault tolerance according to the surrogate application or use-case.
Currently, there are numerous approaches and replication protocols developed supporting different use-cases. There are two prominent variations of replication protocols, generic
protocols, and database specific ones. The two main techniques associated with generic
replication protocols are the active and passive replication. Although generic replication
techniques are fully matured and widely used, there are inherent problems associated with
those protocols, namely: performance issues of the primary replica of passive replication
and the determinism required by the active replication. Some of those disadvantages are
mitigated by specific database replication protocols (e.g., using multi-master) but, those
protocols do not allow a separation between logic and data and they can not be decoupled
from the database engine. Moreover, recent strategies consider highly-scalable and fault tolerant distributed logging mechanisms, allowing for newer designs based purely on logs
to power replication.
To mitigate the shortcomings found in both active and passive replication mechanisms,
but also in partial variations of these methods, this dissertation presents a hybrid replication middleware, SQLware. The cornerstone of the approach lies in the decoupling between
the logical replication layer and the data store, together with the use of a highly scalable distributed log that provides fault-tolerance and high-availability. We validated the prototype
by conducting a benchmarking campaign to evaluate the overall system’s performance under two distinct infrastructures, namely a private medium class server, and a private high
performance computing cluster. Across the evaluation campaign, we considered the TPCC benchmark, a widely used benchmark in the evaluation of Online transaction processing
(OLTP) database systems. Results show that SQLware was able to achieve 150 times more
throughput when compared with the native replication mechanism of the underlying data
store considered as baseline, PostgreSQL.This work was partially funded by FCT - Fundação para a Ciência e a Tecnologia, I.P.,
(Portuguese Foundation for Science and Technology) within project UID/EEA/50014/201
- …