623 research outputs found
TANDEM: taming failures in next-generation datacenters with emerging memory
The explosive growth of online services, leading to unforeseen scales, has made modern datacenters highly prone to failures. Taming these failures hinges on fast and correct recovery, minimizing service interruptions.
Applications, owing to recovery, entail additional measures to maintain a recoverable state of data and computation logic during their failure-free execution. However, these precautionary measures have
severe implications on performance, correctness, and programmability, making recovery incredibly challenging to realize in practice.
Emerging memory, particularly non-volatile memory (NVM) and disaggregated memory (DM), offers a promising opportunity to achieve fast recovery with maximum performance. However, incorporating these technologies into datacenter architecture presents significant challenges; Their distinct architectural attributes, differing significantly from traditional memory devices, introduce new semantic challenges for
implementing recovery, complicating correctness and programmability.
Can emerging memory enable fast, performant, and correct recovery in the datacenter? This thesis aims to answer this question while addressing the associated challenges.
When architecting datacenters with emerging memory, system architects face four key challenges: (1) how to guarantee correct semantics; (2) how to efficiently enforce correctness with optimal performance; (3) how to validate end-to-end correctness including recovery; and (4) how to preserve programmer productivity (Programmability).
This thesis aims to address these challenges through the following approaches: (a)
defining precise consistency models that formally specify correct end-to-end semantics
in the presence of failures (consistency models also play a crucial role in programmability); (b) developing new low-level mechanisms to efficiently enforce the prescribed models given the capabilities of emerging memory; and (c) creating robust testing frameworks to validate end-to-end correctness and recovery.
We start our exploration with non-volatile memory (NVM), which offers fast persistence capabilities directly accessible through the processorâs load-store (memory) interface. Notably, these capabilities can be leveraged to enable fast recovery for Log-Free Data Structures (LFDs) while maximizing performance. However, due to the complexity of modern cache hierarchies, data hardly persist in any specific order, jeop-
ardizing recovery and correctness. Therefore, recovery needs primitives that explicitly control the order of updates to NVM (known as persistency models). We outline the precise specification of a novel persistency model â Release Persistency (RP) â that provides a consistency guarantee for LFDs on what remains in non-volatile memory upon failure. To efficiently enforce RP, we propose a novel microarchitecture mechanism,
lazy release persistence (LRP). Using standard LFDs benchmarks, we show that LRP achieves fast recovery while incurring minimal overhead on performance.
We continue our discussion with memory disaggregation which decouples memory from traditional monolithic servers, offering a promising pathway for achieving very high availability in replicated in-memory data stores. Achieving such availability hinges on transaction protocols that can efficiently handle recovery in this setting, where
compute and memory are independent. However, there is a challenge: disaggregated memory (DM) fails to work with RPC-style protocols, mandating one-sided transaction protocols. Exacerbating the problem, one-sided transactions expose critical low-level
ordering to architects, posing a threat to correctness. We present a highly available transaction protocol, Pandora, that is specifically designed to achieve fast recovery in disaggregated key-value stores (DKVSes).
Pandora is the first one-sided transactional protocol that ensures correct, non-blocking, and fast recovery in DKVS. Our experimental implementation artifacts demonstrate that Pandora achieves fast recovery and high availability while causing minimal disruption to services.
Finally, we introduce a novel target litmus-testing framework â DART â to validate the end-to-end correctness of transactional protocols with recovery. Using DARTâs target testing capabilities, we have found several critical bugs in Pandora, highlighting the need for robust end-to-end testing methods in the design loop to iteratively fix correctness bugs. Crucially, DART is lightweight and black-box, thereby eliminating
any intervention from the programmers
Auditable and performant Byzantine consensus for permissioned ledgers
Permissioned ledgers allow users to execute transactions against a data store, and retain proof of their execution in a replicated ledger. Each replica verifies the transactionsâ execution and ensures that, in perpetuity, a committed transaction cannot be removed from the ledger. Unfortunately, this is not guaranteed by todayâs permissioned ledgers, which can be re-written if an arbitrary number of replicas collude. In addition, the transaction throughput of permissioned ledgers is low, hampering real-world deployments, by not taking advantage of multi-core CPUs and hardware accelerators.
This thesis explores how permissioned ledgers and their consensus protocols can be made auditable in perpetuity; even when all replicas collude and re-write the ledger. It also addresses how Byzantine consensus protocols can be changed to increase the execution throughput of complex transactions. This thesis makes the following contributions:
1. Always auditable Byzantine consensus protocols. We present a permissioned ledger system that can assign blame to individual replicas regardless of how many of them misbehave. This is achieved by signing and storing consensus protocol messages in the ledger and providing clients with signed, universally-verifiable receipts.
2. Performant transaction execution with hardware accelerators. Next, we describe a cloud-based ML inference service that provides strong integrity guarantees, while staying compatible with current inference APIs. We change the Byzantine consensus protocol to execute machine learning (ML) inference computation on GPUs to optimize throughput and latency of ML inference computation.
3. Parallel transactions execution on multi-core CPUs. Finally, we introduce a permissioned ledger that executes transactions, in parallel, on multi-core CPUs. We separate the execution of transactions between the primary and secondary replicas. The primary replica executes transactions on multiple CPU cores and creates a dependency graph of the transactions that the backup replicas utilize to execute transactions in parallel.Open Acces
Secure storage systems for untrusted cloud environments
The cloud has become established for applications that need to be scalable and highly
available. However, moving data to data centers owned and operated by a third party,
i.e., the cloud provider, raises security concerns because a cloud provider could easily
access and manipulate the data or program flow, preventing the cloud from being
used for certain applications, like medical or financial.
Hardware vendors are addressing these concerns by developing Trusted Execution
Environments (TEEs) that make the CPU state and parts of memory inaccessible from
the host software. While TEEs protect the current execution state, they do not provide
security guarantees for data which does not fit nor reside in the protected memory
area, like network and persistent storage.
In this work, we aim to address TEEsâ limitations in three different ways, first we
provide the trust of TEEs to persistent storage, second we extend the trust to multiple
nodes in a network, and third we propose a compiler-based solution for accessing
heterogeneous memory regions. More specifically,
⢠SPEICHER extends the trust provided by TEEs to persistent storage. SPEICHER
implements a key-value interface. Its design is based on LSM data structures, but
extends them to provide confidentiality, integrity, and freshness for the stored
data. Thus, SPEICHER can prove to the client that the data has not been tampered
with by an attacker.
⢠AVOCADO is a distributed in-memory key-value store (KVS) that extends the
trust that TEEs provide across the network to multiple nodes, allowing KVSs to
scale beyond the boundaries of a single node. On each node, AVOCADO carefully
divides data between trusted memory and untrusted host memory, to maximize
the amount of data that can be stored on each node. AVOCADO leverages the
fact that we can model network attacks as crash-faults to trust other nodes with
a hardened ABD replication protocol.
⢠TOAST is based on the observation that modern high-performance systems
often use several different heterogeneous memory regions that are not easily
distinguishable by the programmer. The number of regions is increased by the
fact that TEEs divide memory into trusted and untrusted regions. TOAST is a
compiler-based approach to unify access to different heterogeneous memory
regions and provides programmability and portability. TOAST uses a
load/store interface to abstract most library interfaces for different memory
regions
Efficient Black-box Checking of Snapshot Isolation in Databases
Snapshot isolation (SI) is a prevalent weak isolation level that avoids the
performance penalty imposed by serializability and simultaneously prevents
various undesired data anomalies. Nevertheless, SI anomalies have recently been
found in production cloud databases that claim to provide the SI guarantee.
Given the complex and often unavailable internals of such databases, a
black-box SI checker is highly desirable.
In this paper we present PolySI, a novel black-box checker that efficiently
checks SI and provides understandable counterexamples upon detecting
violations. PolySI builds on a novel characterization of SI using generalized
polygraphs (GPs), for which we establish its soundness and completeness. PolySI
employs an SMT solver and also accelerates SMT solving by utilizing the compact
constraint encoding of GPs and domain-specific optimizations for pruning
constraints. As demonstrated by our extensive assessment, PolySI successfully
reproduces all of 2477 known SI anomalies, detects novel SI violations in three
production cloud databases, identifies their causes, outperforms the
state-of-the-art black-box checkers under a wide range of workloads, and can
scale up to large-sized workloads.Comment: 20 pages, 15 figures, accepted by PVLD
Merging Queries in OLTP Workloads
OLTP applications are usually executed by a high number of clients in parallel and are typically faced with high throughput demand as well as a constraint latency requirement for individual statements. In enterprise scenarios, they often face the challenge to deal with overload spikes resulting from events such as Cyber Monday or Black Friday. The traditional solution to prevent running out of resources and thus coping with such spikes is to use a significant over-provisioning of the underlying infrastructure. In this thesis, we analyze real enterprise OLTP workloads with respect to statement types, complexity, and hot-spot statements. Interestingly, our findings reveal that workloads are often read-heavy and comprise similar query patterns, which provides a potential to share work of statements belonging to different transactions. In the past, resource sharing has been extensively studied for OLAP workloads. Naturally, the question arises, why studies mainly focus on OLAP and not on OLTP workloads?
At first sight, OLTP queries often consist of simple calculations, such as index look-ups with little sharing potential. In consequence, such queries â due to their short execution time â may not have enough potential for the additional overhead. In addition, OLTP workloads do not only execute read operations but also updates. Therefore, sharing work needs to obey transactional semantics, such as the given isolation level and read-your-own-writes.
This thesis presents THE LEVIATHAN, a novel batching scheme for OLTP workloads, an approach for merging read statements within interactively submitted multi-statement transactions consisting of reads and updates. Our main idea is to merge the execution of statements by merging their plans, thus being able to merge the execution of not only complex, but also simple calculations, such as the aforementioned index look-up. We identify mergeable statements by pattern matching of prepared statement plans, which comes with low overhead. For obeying the isolation level properties and providing read-your-own-writes, we first define a formal framework for merging transactions running under a given isolation level and provide insights into a prototypical implementation of merging within a commercial database system.
Our experimental evaluation shows that, depending on the isolation level, the load in the system, and the read-share of the workload, an improvement of the transaction throughput by up to a factor of 2.5x is possible without compromising the transactional semantics. Another interesting effect we show is that with our strategy, we can increase the throughput of a real enterprise workload by 20%.:1 INTRODUCTION
1.1 Summary of Contributions
1.2 Outline
2 WORKLOAD ANALYSIS
2.1 Analyzing OLTP Benchmarks
2.1.1 YCSB
2.1.2 TATP
2.1.3 TPC Benchmark Scenarios
2.1.4 Summary
2.2 Analyzing OLTP Workloads from Open Source Projects
2.2.1 Characteristics of Workloads
2.2.2 Summary
2.3 Analyzing Enterprise OLTP Workloads
2.3.1 Overview of Reports about OLTP Workload Characteristics
2.3.2 Analysis of SAP Hybris Workload
2.3.3 Summary
2.4 Conclusion
3 RELATED WORK ON QUERY MERGING
3.1 Merging the Execution of Operators
3.2 Merging the Execution of Subplans
3.3 Merging the Results of Subplans
3.4 Merging the Execution of Full Plans
3.5 Miscellaneous Works on Merging
3.6 Discussion
4 MERGING STATEMENTS IN MULTI STATEMENT TRANSACTIONS
4.1 Overview of Our Approach
4.1.1 Examples
4.1.2 Why NaĂŻve Merging Fails
4.2 THE LEVIATHAN Approach
4.3 Formalizing THE LEVIATHAN Approach
4.3.1 Transaction Theory
4.3.2 Merging Under MVCC
4.4 Merging Reads Under Different Isolation Levels
4.4.1 Read Uncommitted
4.4.2 Read Committed
4.4.3 Repeatable Read
4.4.4 Snapshot Isolation
4.4.5 Serializable
4.4.6 Discussion
4.5 Merging Writes Under Different Isolation Levels
4.5.1 Read Uncommitted
4.5.2 Read Committed
4.5.3 Snapshot Isolation
4.5.4 Serializable
4.5.5 Handling Dependencies
4.5.6 Discussion
5 SYSTEM MODEL
5.1 Definition of the Term âOverloadâ
5.2 Basic Queuing Model
5.2.1 Option (1): Replacement with a Merger Thread
5.2.2 Option (2): Adding Merger Thread
5.2.3 Using Multiple Merger Threads
5.2.4 Evaluation
5.3 Extended Queue Model
5.3.1 Option (1): Replacement with a Merger Thread
5.3.2 Option (2): Adding Merger Thread
5.3.3 Evaluation
6 IMPLEMENTATION
6.1 Background: SAP HANA
6.2 System Design
6.2.1 Read Committed
6.2.2 Snapshot Isolation
6.3 Merger Component
6.3.1 Overview
6.3.2 Dequeuing
6.3.3 Merging
6.3.4 Sending
6.3.5 Updating MTx State
6.4 Challenges in the Implementation of Merging Writes
6.4.1 SQL String Implementation
6.4.2 Update Count
6.4.3 Error Propagation
6.4.4 Abort and Rollback
7 EVALUATION
7.1 Benchmark Settings
7.2 System Settings
7.2.1 Experiment I: End-to-end Response Time Within a SAP Hybris System
7.2.2 Experiment II: Dequeuing Strategy
7.2.3 Experiment III: Merging Improvement on Different Statement, Transaction and Workload Types
7.2.4 Experiment IV: End-to-End Latency in YCSB
7.2.5 Experiment V: Breakdown of Execution in YCSB
7.2.6 Discussion of System Settings
7.3 Merging in Interactive Transactions
7.3.1 Experiment VI: Merging TATP in Read Uncommitted
7.3.2 Experiment VII: Merging TATP in Read Committed
7.3.3 Experiment VIII: Merging TATP in Snapshot Isolation
7.4 Merging Queries in Stored Procedures
Experiment IX: Merging TATP Stored Procedures in Read Committed
7.5 Merging SAP Hybris
7.5.1 Experiment X: CPU-time Breakdown on HANA Components
7.5.2 Experiment XI: Merging Media Query in SAP Hybris
7.5.3 Discussion of our Results in Comparison with Related Work
8 CONCLUSION
8.1 Summary
8.2 Future Research Directions
REFERENCES
A UML CLASS DIAGRAM
The FIDS Theorems: Tensions between Multinode and Multicore Performance in Transactional Systems
Traditionally, distributed and parallel transactional systems have been
studied in isolation, as they targeted different applications and experienced
different bottlenecks. However, modern high-bandwidth networks have made the
study of systems that are both distributed (i.e., employ multiple nodes) and
parallel (i.e., employ multiple cores per node) necessary to truly make use of
the available hardware.
In this paper, we study the performance of these combined systems and show
that there are inherent tradeoffs between a system's ability to have fast and
robust distributed communication and its ability to scale to multiple cores.
More precisely, we formalize the notions of a \emph{fast deciding} path of
communication to commit transactions quickly in good executions, and
\emph{seamless fault tolerance} that allows systems to remain robust to server
failures. We then show that there is an inherent tension between these two
natural distributed properties and well-known multicore scalability properties
in transactional systems. Finally, we show positive results; it is possible to
construct a parallel distributed transactional system if any one of the
properties we study is removed
ORPE -- A Data Semantics Driven Concurrency Control
This paper presents a concurrency control mechanism that does not follow a
'one concurrency control mechanism fits all needs' strategy. With the presented
mechanism a transaction runs under several concurrency control mechanisms and
the appropriate one is chosen based on the accessed data. For this purpose, the
data is divided into four classes based on its access type and usage
(semantics). Class (the optimistic class) implements a first-committer-wins
strategy, class (the reconciliation class) implements a
first-n-committers-win strategy, class (the pessimistic class) implements a
first-reader-wins strategy, and class (the escrow class) implements a
first-n-readers-win strategy. Accordingly, the model is called \PeFS. The
selected concurrency control mechanism may be automatically adapted at run-time
according to the current load or a known usage profile. This run-time
adaptation allows \Pe to balance the commit rate and the response time even
under changing conditions. \Pe outperforms the Snapshot Isolation concurrency
control in terms of response time by a factor of approximately 4.5 under heavy
transactional load (4000 concurrent transactions). As consequence, the degree
of concurrency is 3.2 times higher.Comment: 20 pages, 7 tables, 15 figure
Towards Scalable Real-time Analytics:: An Architecture for Scale-out of OLxP Workloads
We present an overview of our work on the SAP HANA Scale-out Extension, a novel distributed database architecture designed to support large scale analytics over real-time data. This platform permits high performance OLAP with massive scale-out capabilities, while concurrently allowing OLTP workloads. This dual capability enables analytics over real-time changing data and allows fine grained user-specified service level agreements (SLAs) on data freshness. We advocate the decoupling of core database components such as query processing, concurrency control, and persistence, a design choice made possible by advances in high-throughput low-latency networks and storage devices. We provide full ACID guarantees and build on a logical timestamp mechanism to provide MVCC-based snapshot isolation, while not requiring synchronous updates of replicas. Instead, we use asynchronous update propagation guaranteeing consistency with timestamp validation. We provide a view into the design and development of a large scale data management platform for real-time analytics, driven by the needs of modern enterprise customers
Modular Collaborative Program Analysis
With our world increasingly relying on computers, it is important to ensure the quality, correctness, security, and performance of software systems. Static analysis that computes properties of computer programs without executing them has been an important method to achieve this for decades. However, static analysis faces major chal-
lenges in increasingly complex programming languages and software systems and increasing and sometimes conflicting demands for soundness, precision, and scalability. In order to cope with these challenges, it is necessary to build static analyses for complex problems from small, independent, yet collaborating modules that can be developed in isolation and combined in a plug-and-play manner.
So far, no generic architecture to implement and combine a broad range of dissimilar static analyses exists. The goal of this thesis is thus to design such an architecture and implement it as a generic framework for developing modular, collaborative static analyses. We use several, diverse case-study analyses from which we systematically derive requirements to guide the design of the framework. Based on this, we propose the use of a blackboard-architecture style collaboration of analyses that we implement in the OPAL framework. We also develop a formal model of our architectures core concepts and show how it enables freely composing analyses while retaining their soundness guarantees.
We showcase and evaluate our architecture using the case-study analyses, each of which shows how important and complex problems of static analysis can be addressed using a modular, collaborative implementation style. In particular, we show how a modular architecture for the construction of call graphs ensures consistent soundness of different algorithms. We show how modular analyses for different aspects of immutability mutually benefit each other. Finally, we show how the analysis of method purity can benefit from the use of other complex analyses in a collaborative manner and from exchanging different analysis implementations that exhibit different characteristics. Each of these case studies improves over the respective state of the art in terms of soundness, precision, and/or scalability and shows how our architecture enables experimenting with and fine-tuning trade-offs between these qualities
Efficient Geo-Distributed Transaction Processing
Distributed deterministic database systems support OLTP workloads over geo-replicated data. Providing these transactions with ACID guarantees requires a delay of multiple wide-area network (WAN) round trips of messaging to totally order transactions globally. This thesis presents Sloth, a geo-replicated database system that can serializably commit transactions after a delay of only a single WAN round trip of messaging. Sloth reduces the cost of determining the total global order for all transactions by leveraging deterministic merging of partial sequences of transactions per geographic region. Using popular workload benchmarks over geo-replicated Azure, this thesis shows that Sloth outperforms state-of-the-art comparison systems to deliver low-latency transaction execution
- âŚ