598 research outputs found
TANDEM: taming failures in next-generation datacenters with emerging memory
The explosive growth of online services, leading to unforeseen scales, has made modern datacenters highly prone to failures. Taming these failures hinges on fast and correct recovery, minimizing service interruptions.
Applications, owing to recovery, entail additional measures to maintain a recoverable state of data and computation logic during their failure-free execution. However, these precautionary measures have
severe implications on performance, correctness, and programmability, making recovery incredibly challenging to realize in practice.
Emerging memory, particularly non-volatile memory (NVM) and disaggregated memory (DM), offers a promising opportunity to achieve fast recovery with maximum performance. However, incorporating these technologies into datacenter architecture presents significant challenges; Their distinct architectural attributes, differing significantly from traditional memory devices, introduce new semantic challenges for
implementing recovery, complicating correctness and programmability.
Can emerging memory enable fast, performant, and correct recovery in the datacenter? This thesis aims to answer this question while addressing the associated challenges.
When architecting datacenters with emerging memory, system architects face four key challenges: (1) how to guarantee correct semantics; (2) how to efficiently enforce correctness with optimal performance; (3) how to validate end-to-end correctness including recovery; and (4) how to preserve programmer productivity (Programmability).
This thesis aims to address these challenges through the following approaches: (a)
defining precise consistency models that formally specify correct end-to-end semantics
in the presence of failures (consistency models also play a crucial role in programmability); (b) developing new low-level mechanisms to efficiently enforce the prescribed models given the capabilities of emerging memory; and (c) creating robust testing frameworks to validate end-to-end correctness and recovery.
We start our exploration with non-volatile memory (NVM), which offers fast persistence capabilities directly accessible through the processor’s load-store (memory) interface. Notably, these capabilities can be leveraged to enable fast recovery for Log-Free Data Structures (LFDs) while maximizing performance. However, due to the complexity of modern cache hierarchies, data hardly persist in any specific order, jeop-
ardizing recovery and correctness. Therefore, recovery needs primitives that explicitly control the order of updates to NVM (known as persistency models). We outline the precise specification of a novel persistency model – Release Persistency (RP) – that provides a consistency guarantee for LFDs on what remains in non-volatile memory upon failure. To efficiently enforce RP, we propose a novel microarchitecture mechanism,
lazy release persistence (LRP). Using standard LFDs benchmarks, we show that LRP achieves fast recovery while incurring minimal overhead on performance.
We continue our discussion with memory disaggregation which decouples memory from traditional monolithic servers, offering a promising pathway for achieving very high availability in replicated in-memory data stores. Achieving such availability hinges on transaction protocols that can efficiently handle recovery in this setting, where
compute and memory are independent. However, there is a challenge: disaggregated memory (DM) fails to work with RPC-style protocols, mandating one-sided transaction protocols. Exacerbating the problem, one-sided transactions expose critical low-level
ordering to architects, posing a threat to correctness. We present a highly available transaction protocol, Pandora, that is specifically designed to achieve fast recovery in disaggregated key-value stores (DKVSes).
Pandora is the first one-sided transactional protocol that ensures correct, non-blocking, and fast recovery in DKVS. Our experimental implementation artifacts demonstrate that Pandora achieves fast recovery and high availability while causing minimal disruption to services.
Finally, we introduce a novel target litmus-testing framework – DART – to validate the end-to-end correctness of transactional protocols with recovery. Using DART’s target testing capabilities, we have found several critical bugs in Pandora, highlighting the need for robust end-to-end testing methods in the design loop to iteratively fix correctness bugs. Crucially, DART is lightweight and black-box, thereby eliminating
any intervention from the programmers
ENHANCING CLOUD SYSTEM RUNTIME TO ADDRESS COMPLEX FAILURES
As the reliance on cloud systems intensifies in our progressively digital world, understanding and reinforcing their reliability becomes more crucial than ever. Despite impressive advancements in augmenting the resilience of cloud systems, the growing incidence of complex failures now poses a substantial challenge to the availability of these systems. With cloud systems continuing to scale and increase in complexity, failures not only become more elusive to detect but can also lead to more catastrophic consequences. Such failures question the foundational premises of conventional fault-tolerance designs, necessitating the creation of novel system designs to counteract them.
This dissertation aims to enhance distributed systems’ capabilities to detect, localize, and react to complex failures at runtime. To this end, this dissertation makes contributions to address three emerging categories of failures in cloud systems. The first part delves into the investigation of partial failures, introducing OmegaGen, a tool adept at generating tailored checkers for detecting and localizing such failures. The second part grapples with silent semantic failures prevalent in cloud systems, showcasing our study findings, and introducing Oathkeeper, a tool that leverages past failures to infer rules and expose these silent issues. The third part explores solutions to slow failures via RESIN, a framework specifically designed to detect, diagnose, and mitigate memory leaks in cloud-scale infrastructures, developed in collaboration with Microsoft Azure. The dissertation concludes by offering insights into future directions for the construction of reliable cloud systems
Distributed System Fuzzing
Grey-box fuzzing is the lightweight approach of choice for finding bugs in
sequential programs. It provides a balance between efficiency and effectiveness
by conducting a biased random search over the domain of program inputs using a
feedback function from observed test executions. For distributed system
testing, however, the state-of-practice is represented today by only black-box
tools that do not attempt to infer and exploit any knowledge of the system's
past behaviours to guide the search for bugs.
In this work, we present Mallory: the first framework for grey-box
fuzz-testing of distributed systems. Unlike popular black-box distributed
system fuzzers, such as Jepsen, that search for bugs by randomly injecting
network partitions and node faults or by following human-defined schedules,
Mallory is adaptive. It exercises a novel metric to learn how to maximize the
number of observed system behaviors by choosing different sequences of faults,
thus increasing the likelihood of finding new bugs. The key enablers for our
approach are the new ideas of timeline-driven testing and timeline abstraction
that provide the feedback function guiding a biased random search for failures.
Mallory dynamically constructs Lamport timelines of the system behaviour,
abstracts these timelines into happens-before summaries, and introduces faults
guided by its real-time observation of the summaries.
We have evaluated Mallory on a diverse set of widely-used industrial
distributed systems. Compared to the start-of-the-art black-box fuzzer Jepsen,
Mallory explores more behaviours and takes less time to find bugs. Mallory
discovered 22 zero-day bugs (of which 18 were confirmed by developers),
including 10 new vulnerabilities, in rigorously-tested distributed systems such
as Braft, Dqlite, and Redis. 6 new CVEs have been assigned
NCC: Natural Concurrency Control for Strictly Serializable Datastores by Avoiding the Timestamp-Inversion Pitfall
Strictly serializable datastores greatly simplify the development of correct
applications by providing strong consistency guarantees. However, existing
techniques pay unnecessary costs for naturally consistent transactions, which
arrive at servers in an order that is already strictly serializable. We find
these transactions are prevalent in datacenter workloads. We exploit this
natural arrival order by executing transaction requests with minimal costs
while optimistically assuming they are naturally consistent, and then leverage
a timestamp-based technique to efficiently verify if the execution is indeed
consistent. In the process of designing such a timestamp-based technique, we
identify a fundamental pitfall in relying on timestamps to provide strict
serializability, and name it the timestamp-inversion pitfall. We find
timestamp-inversion has affected several existing works.
We present Natural Concurrency Control (NCC), a new concurrency control
technique that guarantees strict serializability and ensures minimal costs --
i.e., one-round latency, lock-free, and non-blocking execution -- in the best
(and common) case by leveraging natural consistency. NCC is enabled by three
key components: non-blocking execution, decoupled response control, and
timestamp-based consistency check. NCC avoids timestamp-inversion with a new
technique: response timing control, and proposes two optimization techniques,
asynchrony-aware timestamps and smart retry, to reduce false aborts. Moreover,
NCC designs a specialized protocol for read-only transactions, which is the
first to achieve the optimal best-case performance while ensuring strict
serializability, without relying on synchronized clocks. Our evaluation shows
that NCC outperforms state-of-the-art solutions by an order of magnitude on
many workloads
Securing IoT Applications through Decentralised and Distributed IoT-Blockchain Architectures
The integration of blockchain into IoT can provide reliable control of the IoT network's
ability to distribute computation over a large number of devices. It also allows the AI
system to use trusted data for analysis and forecasts while utilising the available IoT
hardware to coordinate the execution of tasks in parallel, using a fully distributed
approach.
This thesis's  rst contribution is a practical implementation of a real world IoT-
blockchain application,
ood detection use case, is demonstrated using Ethereum proof
of authority (PoA). This includes performance measurements of the transaction con-
 rmation time, the system end-to-end latency, and the average power consumption.
The study showed that blockchain can be integrated into IoT applications, and that
Ethereum PoA can be used within IoT for permissioned implementation. This can be
achieved while the average energy consumption of running the
ood detection system
including the Ethereum Geth client is small (around 0.3J).
The second contribution is a novel IoT-centric consensus protocol called honesty-
based distributed proof of authority (HDPoA) via scalable work. HDPoA was analysed
and then deployed and tested. Performance measurements and evaluation along with
the security analyses of HDPoA were conducted using a total of 30 di erent IoT de-
vices comprising Raspberry Pis, ESP32, and ESP8266 devices. These measurements
included energy consumption, the devices' hash power, and the transaction con rma-
tion time. The measured values of hash per joule (h/J) for mining were 13.8Kh/J,
54Kh/J, and 22.4Kh/J when using the Raspberry Pi, the ESP32 devices, and the
ESP8266 devices, respectively, this achieved while there is limited impact on each de-
vice's power. In HDPoA the transaction con rmation time was reduced to only one
block compared to up to six blocks in bitcoin.
The third contribution is a novel, secure, distributed and decentralised architecture
for supporting the implementation of distributed arti cial intelligence (DAI) using
hardware platforms provided by IoT. A trained DAI system was implemented over the
IoT, where each IoT device hosts one or more neurons within the DAI layers. This
is accomplished through the utilisation of blockchain technology that allows trusted
interaction and information exchange between distributed neurons. Three di erent
datasets were tested and the system achieved a similar accuracy as when testing on a
standalone system; both achieved accuracies of 92%-98%. The system accomplished
that while ensuring an overall latency of as low as two minutes. This showed the secure architecture capabilities of facilitating the implementation of DAI within IoT
while ensuring the accuracy of the system is preserved.
The fourth contribution is a novel and secure architecture that integrates the ad-
vantages o ered by edge computing, arti cial intelligence (AI), IoT end-devices, and
blockchain. This new architecture has the ability to monitor the environment, collect
data, analyse it, process it using an AI-expert engine, provide predictions and action-
able outcomes, and  nally share it on a public blockchain platform. The pandemic
caused by the wide and rapid spread of the novel coronavirus COVID-19 was used as
a use-case implementation to test and evaluate the proposed system. While providing
the AI-engine trusted data, the system achieved an accuracy of 95%,. This is achieved
while the AI-engine only requires a 7% increase in power consumption. This demon-
strate the system's ability to protect the data and support the AI system, and improves
the IoT overall security with limited impact on the IoT devices.
The  fth and  nal contribution is enhancing the security of the HDPoA through
the integration of a hardware secure module (HSM) and a hardware wallet (HW). A
performance evaluation regarding the energy consumption of nodes that are equipped
with HSM and HW and a security analysis were conducted. In addition to enhancing
the nodes' security, the HSM can be used to sign more than 120 bytes/joule and
encrypt up to 100 bytes/joule, while the HW can be used to sign up to 90 bytes/joule
and encrypt up to 80 bytes/joule. The result and analyses demonstrated that the HSM
and HW enhance the security of HDPoA, and also can be utilised within IoT-blockchain
applications while providing much needed security in terms of con dentiality, trust in
devices, and attack deterrence.
The above contributions showed that blockchain can be integrated into IoT systems.
It showed that blockchain can successfully support the integration of other technolo-
gies such as AI, IoT end devices, and edge computing into one system thus allowing
organisations and users to bene t greatly from a resilient, distributed, decentralised,
self-managed, robust, and secure systems
Objcache: An Elastic Filesystem over External Persistent Storage for Container Clusters
Container virtualization enables emerging AI workloads such as model serving,
highly parallelized training, machine learning pipelines, and so on, to be
easily scaled on demand on the elastic cloud infrastructure. Particularly, AI
workloads require persistent storage to store data such as training inputs,
models, and checkpoints. An external storage system like cloud object storage
is a common choice because of its elasticity and scalability. To mitigate
access latency to external storage, caching at a local filesystem is an
essential technique. However, building local caches on scaling clusters must
cope with explosive disk usage, redundant networking, and unexpected failures.
We propose objcache, an elastic filesystem over external storage. Objcache
introduces an internal transaction protocol over Raft logging to enable atomic
updates of distributed persistent states with consistent hashing. The proposed
transaction protocol can also manage inode dirtiness by maintaining the
consistency between the local cache and external storage. Objcache supports
scaling down to zero by automatically evicting dirty files to external storage.
Our evaluation reports that objcache speeded up model serving startup by 98.9%
compared to direct copies via S3 interfaces. Scaling up with dirty files
completed from 2 to 14 seconds with 1024 dirty files.Comment: 13 page
Distributed Multi-writer Multi-reader Atomic Register with Optimistically Fast Read and Write
A distributed multi-writer multi-reader (MWMR) atomic register is an
important primitive that enables a wide range of distributed algorithms. Hence,
improving its performance can have large-scale consequences. Since the seminal
work of ABD emulation in the message-passing networks [JACM '95], many
researchers study fast implementations of atomic registers under various
conditions. "Fast" means that a read or a write can be completed with 1
round-trip time (RTT), by contacting a simple majority. In this work, we
explore an atomic register with optimal resilience and "optimistically fast"
read and write operations. That is, both operations can be fast if there is no
concurrent write.
This paper has three contributions: (i) We present Gus, the emulation of an
MWMR atomic register with optimal resilience and optimistically fast reads and
writes when there are up to 5 nodes; (ii) We show that when there are > 5
nodes, it is impossible to emulate an MWMR atomic register with both
properties; and (iii) We implement Gus in the framework of EPaxos and Gryff,
and show that Gus provides lower tail latency than state-of-the-art systems
such as EPaxos, Gryff, Giza, and Tempo under various workloads in the context
of geo-replicated object storage systems
Confidential Consortium Framework: Secure Multiparty Applications with Confidentiality, Integrity, and High Availability
Confidentiality, integrity protection, and high availability, abbreviated to
CIA, are essential properties for trustworthy data systems. The rise of cloud
computing and the growing demand for multiparty applications however means that
building modern CIA systems is more challenging than ever. In response, we
present the Confidential Consortium Framework (CCF), a general-purpose
foundation for developing secure stateful CIA applications. CCF combines
centralized compute with decentralized trust, supporting deployment on
untrusted cloud infrastructure and transparent governance by mutually untrusted
parties. CCF leverages hardware-based trusted execution environments for
remotely verifiable confidentiality and code integrity. This is coupled with
state machine replication backed by an auditable immutable ledger for data
integrity and high availability. CCF enables each service to bring its own
application logic, custom multiparty governance model, and deployment scenario,
decoupling the operators of nodes from the consortium that governs them. CCF is
open-source and available now at https://github.com/microsoft/CCF.Comment: 16 pages, 9 figures. To appear in the Proceedings of the VLDB
Endowment, Volume 1
Leveraging TLA+ Specifications to Improve the Reliability of the ZooKeeper Coordination Service
ZooKeeper is a coordination service, widely used as a backbone of various
distributed systems. Though its reliability is of critical importance, testing
is insufficient for an industrial-strength system of the size and complexity of
ZooKeeper, and deep bugs can still be found. To this end, we resort to formal
TLA+ specifications to further improve the reliability of ZooKeeper. Our
primary objective is usability and automation, rather than full verification.
We incrementally develop three levels of specifications for ZooKeeper. We first
obtain the protocol specification, which unambiguously specifies the Zab
protocol behind ZooKeeper. We then proceed to a finer grain and obtain the
system specification, which serves as the super-doc for system development. In
order to further leverage the model-level specification to improve the
reliability of the code-level implementation, we develop the test
specification, which guides the explorative testing of the ZooKeeper
implementation. The formal specifications help eliminate the ambiguities in the
protocol design and provide comprehensive system documentation. They also help
find critical deep bugs in system implementation, which are beyond the reach of
state-of-the-art testing techniques. Our specifications have been merged into
the official Apache ZooKeeper project
- …