98 research outputs found
Open Programming Language Interpreters
Context: This paper presents the concept of open programming language
interpreters and the implementation of a framework-level metaobject protocol
(MOP) to support them. Inquiry: We address the problem of dynamic interpreter
adaptation to tailor the interpreter's behavior on the task to be solved and to
introduce new features to fulfill unforeseen requirements. Many languages
provide a MOP that to some degree supports reflection. However, MOPs are
typically language-specific, their reflective functionality is often
restricted, and the adaptation and application logic are often mixed which
hardens the understanding and maintenance of the source code. Our system
overcomes these limitations. Approach: We designed and implemented a system to
support open programming language interpreters. The prototype implementation is
integrated in the Neverlang framework. The system exposes the structure,
behavior and the runtime state of any Neverlang-based interpreter with the
ability to modify it. Knowledge: Our system provides a complete control over
interpreter's structure, behavior and its runtime state. The approach is
applicable to every Neverlang-based interpreter. Adaptation code can
potentially be reused across different language implementations. Grounding:
Having a prototype implementation we focused on feasibility evaluation. The
paper shows that our approach well addresses problems commonly found in the
research literature. We have a demonstrative video and examples that illustrate
our approach on dynamic software adaptation, aspect-oriented programming,
debugging and context-aware interpreters. Importance: To our knowledge, our
paper presents the first reflective approach targeting a general framework for
language development. Our system provides full reflective support for free to
any Neverlang-based interpreter. We are not aware of any prior application of
open implementations to programming language interpreters in the sense defined
in this paper. Rather than substituting other approaches, we believe our system
can be used as a complementary technique in situations where other approaches
present serious limitations
TANDEM: taming failures in next-generation datacenters with emerging memory
The explosive growth of online services, leading to unforeseen scales, has made modern datacenters highly prone to failures. Taming these failures hinges on fast and correct recovery, minimizing service interruptions.
Applications, owing to recovery, entail additional measures to maintain a recoverable state of data and computation logic during their failure-free execution. However, these precautionary measures have
severe implications on performance, correctness, and programmability, making recovery incredibly challenging to realize in practice.
Emerging memory, particularly non-volatile memory (NVM) and disaggregated memory (DM), offers a promising opportunity to achieve fast recovery with maximum performance. However, incorporating these technologies into datacenter architecture presents significant challenges; Their distinct architectural attributes, differing significantly from traditional memory devices, introduce new semantic challenges for
implementing recovery, complicating correctness and programmability.
Can emerging memory enable fast, performant, and correct recovery in the datacenter? This thesis aims to answer this question while addressing the associated challenges.
When architecting datacenters with emerging memory, system architects face four key challenges: (1) how to guarantee correct semantics; (2) how to efficiently enforce correctness with optimal performance; (3) how to validate end-to-end correctness including recovery; and (4) how to preserve programmer productivity (Programmability).
This thesis aims to address these challenges through the following approaches: (a)
defining precise consistency models that formally specify correct end-to-end semantics
in the presence of failures (consistency models also play a crucial role in programmability); (b) developing new low-level mechanisms to efficiently enforce the prescribed models given the capabilities of emerging memory; and (c) creating robust testing frameworks to validate end-to-end correctness and recovery.
We start our exploration with non-volatile memory (NVM), which offers fast persistence capabilities directly accessible through the processor’s load-store (memory) interface. Notably, these capabilities can be leveraged to enable fast recovery for Log-Free Data Structures (LFDs) while maximizing performance. However, due to the complexity of modern cache hierarchies, data hardly persist in any specific order, jeop-
ardizing recovery and correctness. Therefore, recovery needs primitives that explicitly control the order of updates to NVM (known as persistency models). We outline the precise specification of a novel persistency model – Release Persistency (RP) – that provides a consistency guarantee for LFDs on what remains in non-volatile memory upon failure. To efficiently enforce RP, we propose a novel microarchitecture mechanism,
lazy release persistence (LRP). Using standard LFDs benchmarks, we show that LRP achieves fast recovery while incurring minimal overhead on performance.
We continue our discussion with memory disaggregation which decouples memory from traditional monolithic servers, offering a promising pathway for achieving very high availability in replicated in-memory data stores. Achieving such availability hinges on transaction protocols that can efficiently handle recovery in this setting, where
compute and memory are independent. However, there is a challenge: disaggregated memory (DM) fails to work with RPC-style protocols, mandating one-sided transaction protocols. Exacerbating the problem, one-sided transactions expose critical low-level
ordering to architects, posing a threat to correctness. We present a highly available transaction protocol, Pandora, that is specifically designed to achieve fast recovery in disaggregated key-value stores (DKVSes).
Pandora is the first one-sided transactional protocol that ensures correct, non-blocking, and fast recovery in DKVS. Our experimental implementation artifacts demonstrate that Pandora achieves fast recovery and high availability while causing minimal disruption to services.
Finally, we introduce a novel target litmus-testing framework – DART – to validate the end-to-end correctness of transactional protocols with recovery. Using DART’s target testing capabilities, we have found several critical bugs in Pandora, highlighting the need for robust end-to-end testing methods in the design loop to iteratively fix correctness bugs. Crucially, DART is lightweight and black-box, thereby eliminating
any intervention from the programmers
Leveraging Non-Volatile Memory in Modern Storage Management Architectures
Non-volatile memory technologies (NVM) introduce a novel class of devices that combine characteristics of both storage and main memory. Like storage, NVM is not only persistent, but also denser and cheaper than DRAM. Like DRAM, NVM is byte-addressable and has lower access latency. In recent years, NVM has gained a lot of attention both in academia and in the data management industry, with views ranging from skepticism to over excitement. Some critics claim that NVM is not cheap enough to replace flash-based SSDs nor is it fast enough to replace DRAM, while others see it simply as a storage device. Supporters of NVM have observed that its low latency and byte-addressability requires radical changes and a complete rewrite of storage management architectures.
This thesis takes a moderate stance between these two views. We consider that, while NVM might not replace flash-based SSD or DRAM in the near future, it has the potential to reduce the gap between them. Furthermore, treating NVM as a regular storage media does not fully leverage its byte-addressability and low latency. On the other hand, completely redesigning systems to be NVM-centric is impractical. Proposals that attempt to leverage NVM to simplify storage management result in completely new architectures that face the same challenges that are already well-understood and addressed by the traditional architectures. Therefore, we take three common storage management architectures as a starting point, and propose incremental changes to enable them to better leverage NVM. First, in the context of log-structured merge-trees, we investigate the impact of storing data in NVM, and devise methods to enable small granularity accesses and NVM-aware caching policies. Second, in the context of B+Trees, we propose to extend the buffer pool and describe a technique based on the concept of optimistic consistency to handle corrupted pages in NVM. Third, we employ NVM to enable larger capacity and reduced costs in a index+log key-value store, and combine it with other techniques to build a system that achieves low tail latency. This thesis aims to describe and evaluate these techniques in order to enable storage management architectures to leverage NVM and achieve increased performance and lower costs, without major architectural changes.:1 Introduction
1.1 Non-Volatile Memory
1.2 Challenges
1.3 Non-Volatile Memory & Database Systems
1.4 Contributions and Outline
2 Background
2.1 Non-Volatile Memory
2.1.1 Types of NVM
2.1.2 Access Modes
2.1.3 Byte-addressability and Persistency
2.1.4 Performance
2.2 Related Work
2.3 Case Study: Persistent Tree Structures
2.3.1 Persistent Trees
2.3.2 Evaluation
3 Log-Structured Merge-Trees
3.1 LSM and NVM
3.2 LSM Architecture
3.2.1 LevelDB
3.3 Persistent Memory Environment
3.4 2Q Cache Policy for NVM
3.5 Evaluation
3.5.1 Write Performance
3.5.2 Read Performance
3.5.3 Mixed Workloads
3.6 Additional Case Study: RocksDB
3.6.1 Evaluation
4 B+Trees
4.1 B+Tree and NVM
4.1.1 Category #1: Buffer Extension
4.1.2 Category #2: DRAM Buffered Access
4.1.3 Category #3: Persistent Trees
4.2 Persistent Buffer Pool with Optimistic Consistency
4.2.1 Architecture and Assumptions
4.2.2 Embracing Corruption
4.3 Detecting Corruption
4.3.1 Embracing Corruption
4.4 Repairing Corruptions
4.5 Performance Evaluation and Expectations
4.5.1 Checksums Overhead
4.5.2 Runtime and Recovery
4.6 Discussion
5 Index+Log Key-Value Stores
5.1 The Case for Tail Latency
5.2 Goals and Overview
5.3 Execution Model
5.3.1 Reactive Systems and Actor Model
5.3.2 Message-Passing Communication
5.3.3 Cooperative Multitasking
5.4 Log-Structured Storage
5.5 Networking
5.6 Implementation Details
5.6.1 NVM Allocation on RStore
5.6.2 Log-Structured Storage and Indexing
5.6.3 Garbage Collection
5.6.4 Logging and Recovery
5.7 Systems Operations
5.8 Evaluation
5.8.1 Methodology
5.8.2 Environment
5.8.3 Other Systems
5.8.4 Throughput Scalability
5.8.5 Tail Latency
5.8.6 Scans
5.8.7 Memory Consumption
5.9 Related Work
6 Conclusion
Bibliography
A PiBenc
Recommended from our members
Providing Easy to Use and Fast Programming Support for Non-Volatile Memories
Non-Volatile Memory (NVM) technologies, such as 3D XPoint, offer DRAM-like performance and byte-addressable access to persistent data. NVMs promise an opportunity for fast, persistent data structures, and a wide range of applications stand to benefit from the performance potential of these technologies. These potential benefits are greatest when applications access NVM directly via load/store instructions rather than conventional file-based interfaces. Directly accessing NVM presents several challenges. In particular, applications need guaranteed consistency and safety semantics to protect their data structures in the face of system failures and programming errors.Implementing data structures that meet these requirements is challenging and error-prone. Existing methods for building persistent data structures require either in-depth code changes to an existing data structure or rewriting the data structure from scratch. Unfortunately, both of these methods are labor-intensive and error-prone.Failure-atomicity libraries and programming language extensions can simplify this task. However, all the proposed solutions either require pervasive changes to existing software or incur unacceptable overheads to runtime performance. As a result, porting legacy applications to leverage NVM is likely to be prohibitively difficult and time-consuming.This dissertation first presents Breeze, an NVM toolchain that minimizes the changes necessary to enable legacy code to reap the benefits of directly accessing NVM. In contrast to PMDK and NVM-Direct, Breeze reduces the programming effort of porting Memcached and MongoDB by up to 2.8×, while providing equal or superior performance.Second, it introduces NVHooks, a compiler that automatically annotates NVM accesses and avoids disruptive and error-prone changes to programs. NVHooks reduces the cost of these annotations by applying novel, NVM-specific optimizations to their placement. For our tested benchmarks, NVHooks matches the performance of hand-annotated code while minimizing programmer effort.Finally, it presents Pronto, a new NVM library that reduces the programming effort required to add persistence to volatile data structures. Pronto uses asynchronous semantic logging (ASL) to allow adding persistence to the existing volatile data structure (e.g., C++ Standard Template Library containers) with minor programming effort. ASL moves most durability code off the critical path. Our evaluation shows Pronto data structures outperform highly-optimized NVM data structures by a large margin
Architectural Principles for Database Systems on Storage-Class Memory
Database systems have long been optimized to hide the higher latency of storage media, yielding complex persistence mechanisms. With the advent of large DRAM capacities, it became possible to keep a full copy of the data in DRAM. Systems that leverage this possibility, such as main-memory databases, keep two copies of the data in two different formats: one in main memory and the other one in storage. The two copies are kept synchronized using snapshotting and logging. This main-memory-centric architecture yields nearly two orders of magnitude faster analytical processing than traditional, disk-centric ones. The rise of Big Data emphasized the importance of such systems with an ever-increasing need for more main memory. However, DRAM is hitting its scalability limits: It is intrinsically hard to further increase its density.
Storage-Class Memory (SCM) is a group of novel memory technologies that promise to alleviate DRAM’s scalability limits. They combine the non-volatility, density, and economic characteristics of storage media with the byte-addressability and a latency close to that of DRAM. Therefore, SCM can serve as persistent main memory, thereby bridging the gap between main memory and storage. In this dissertation, we explore the impact of SCM as persistent main memory on database systems. Assuming a hybrid SCM-DRAM hardware architecture, we propose a novel software architecture for database systems that places primary data in SCM and directly operates on it, eliminating the need for explicit IO. This architecture yields many benefits: First, it obviates the need to reload data from storage to main memory during recovery, as data is discovered and accessed directly in SCM. Second, it allows replacing the traditional logging infrastructure by fine-grained, cheap micro-logging at data-structure level. Third, secondary data can be stored in DRAM and reconstructed during recovery. Fourth, system runtime information can be stored in SCM to improve recovery time. Finally, the system may retain and continue in-flight transactions in case of system failures.
However, SCM is no panacea as it raises unprecedented programming challenges. Given its byte-addressability and low latency, processors can access, read, modify, and persist data in SCM using load/store instructions at a CPU cache line granularity. The path from CPU registers to SCM is long and mostly volatile, including store buffers and CPU caches, leaving the programmer with little control over when data is persisted. Therefore, there is a need to enforce the order and durability of SCM writes using persistence primitives, such as cache line flushing instructions. This in turn creates new failure scenarios, such as missing or misplaced persistence primitives.
We devise several building blocks to overcome these challenges. First, we identify the programming challenges of SCM and present a sound programming model that solves them. Then, we tackle memory management, as the first required building block to build a database system, by designing a highly scalable SCM allocator, named PAllocator, that fulfills the versatile needs of database systems. Thereafter, we propose the FPTree, a highly scalable hybrid SCM-DRAM persistent B+-Tree that bridges the gap between the performance of transient and persistent B+-Trees. Using these building blocks, we realize our envisioned database architecture in SOFORT, a hybrid SCM-DRAM columnar transactional engine. We propose an SCM-optimized MVCC scheme that eliminates write-ahead logging from the critical path of transactions. Since SCM -resident data is near-instantly available upon recovery, the new recovery bottleneck is rebuilding DRAM-based data. To alleviate this bottleneck, we propose a novel recovery technique that achieves nearly instant responsiveness of the database by accepting queries right after recovering SCM -based data, while rebuilding DRAM -based data in the background. Additionally, SCM brings new failure scenarios that existing testing tools cannot detect. Hence, we propose an online testing framework that is able to automatically simulate power failures and detect missing or misplaced persistence primitives. Finally, our proposed building blocks can serve to build more complex systems, paving the way for future database systems on SCM
Recommended from our members
Building Distributed Systems with Non-Volatile Main Memories and RDMA Networks
High-performance, byte-addressable non-volatile main memories (NVMMs) allow application developers to combine storage and memory into a single layer. These high-performance storage systems would be especially useful in large-scale data center environments where data is distributed and replicated across multiple servers.Unfortunately, existing approaches of providing remote storage access rest on the assumption that storage is slow, so the cost of the software and protocols is acceptable. Such assumption no longer holds for the fast NVMM. As a result, taking full advantage of NVMMs’ potential will require changes in system software and networking protocol. This thesis focuses on accessing remote NVMM efficiently using remote direct memory access (RDMA) network. RDMA enables a client to directly access memory on a remote machine without involving its local CPU.This thesis first presents Mojim, a system that provides replicated, reliable, and highly-available NVMM as an operating system service. Applications can access data in Mojim using normal load and store instructions while controlling when and how updates propagate to replicas using system calls. Our evaluation shows Mojim adds little overhead to the un-replicated system and provides 0.4x to 2.7x the throughput of the un-replicated system.This thesis then presents Orion, a distributed file system designed from for NVMM and RDMA networks. Traditional distributed file systems are designed for slower hard drives. These slower media incentivizes complex optimizations (e.g., queuing, striping, and batching) around disk accesses. Orion combines file system functions and network operations into a single layer. It provides low latency metadata accesses and outperforms existing distributed file systems by a large margin.Finally, an NVMM application can map files backed by an NVMM file system into its address space, and accesses them using CPU instructions. In this case, RDMA and NVMM file systems introduce duplication of effort on permissions, naming, and address translation. We introduce two changes to the existing RDMA protocol: the file memory region (FileMR) and range based address translation. By eliminating redundant translations, FileMR minimizes the number of translations done at the NIC, reducing the load on the NIC’s translation cache and resulting in application performance improvement by 1.8x - 2.0x
A Fully Userspace Remote Storage Access Stack
As computer networking has evolved and the available throughput has increased, the efficiency of the network software stack has become increasingly important. This is because the latency introduced by software has gone from insignificant, compared to historically poor network performance, to the largest component of latency for a modern local-area network. Currently, the vast majority of code that accesses the hardware is part of the kernel, because the kernel is responsible for ensuring that user applications do not interfere with each other when accessing the hardware. Remote Direct Memory Access~(RDMA) provides a solution for applications to perform direct data transfers over the network without requiring context switches into the kernel, but relies instead on specialized hardware interfaces to handle the virtual address mappings and transport protocols. This more intelligent hardware allows for direct control from the userspace application, eliminating the cost of context switches into the kernel. This in turn reduces the overall latency of message transfers.
Just like networking, storage is currently undergoing a similar evolution. For most of the recent history of computing, the most common durable storage mechanism has been mechanical hard disk drives, which can only be accessed at block level and have high latency compared to the software drivers used to access the data. However, the introduction of solid state disks~(SSDs) based on Flash significantly decreased the latency, as there are no mechanical parts that need to move to access the data. Upcoming non-volatile memory solutions reduce this latency even further, and even allow byte-level access to the storage medium. Thus, just like with networking, software drivers become the bottleneck and we look for solutions to bypass the kernel to improve the efficiency of direct userspace access to storage.
This thesis offers two contributions as part of a solution to these problems. The first part introduces urdma, a software RDMA driver which leverages the Data Plane Development Kit (DPDK) to perform network data transfers in userspace without specialized RDMA interface hardware. The second part examines remote locking protocols, which are required for synchronization in distributed storage systems. We define an RDMA locking mechanism referred to as Verbs Offload Locking Technology (VOLT), which allows acquisition of a remote lock object without any CPU usage by the target node. This offloading allows VOLT to be used with disaggregated memory servers that have limited onboard CPU resources, while also lowering the application overhead for remote locking. Finally, we define a bytecode framework using enhanced Berkeley Packet Filter (eBPF) bytecode for extending the capabilities of an RDMA-capable network interface card (NIC) with new operations, and show how this can be used to implement our remote locking operation
Recommended from our members
A Paradigm for Scalable, Transactional, and Efficient Spatial Indexes
With large volumes of geo-tagged data collected in various applications, spatial query pro- cessing becomes essential. Query engines depend on efficient indexes to expedite processing. There are three main challenges: scaling out to accommodate large volumes of spatial data, support- ing transactional primitives for strong consistency guarantees, and adapting to highly dynamic workloads. This thesis proposes a paradigm for scalable, transactional, and efficient spatial indexes to significantly reduce development efforts in designing and comparing multiple spatial indexes.This thesis first introduces a distributed and transactional key value store called DTranx to persist the spatial indexes. DTranx follows the SEDA architecture to exploit high concurrency in multi-core environments and it adopts a hybrid of optimistic concurrency control and two-phase commit protocols to narrow down the critical sections of distributed locking during transaction com- mits. Moreover, DTranx integrates a persistent memory based write-ahead log to reduce durability overhead and combines a garbage collection mechanism without affecting normal transactions. To maintain high throughput for search workloads when databases are constantly updated, snapshot transactions are introduced.Then, a paradigm is presented with a set of intuitive APIs and a Mempool runtime to re- duce development efforts. Mempool transparently synchronizes local states of data structures with DTranx and it handles two critical tasks: address translation and transparent server synchroniza- tion, of which the latter includes transaction construction and data synchronization. Furthermore, a dynamic partitioning strategy is integrated into DTranx to generate partitioning and replication plans that reduce inter-server communications and balance resource usage.Lastly, single-threaded data structures BTree and RTree are converted into distributed versions within two weeks. The BTree and RTree achieve 253.07 kops/sec and 77.83 kops/sec through- put respectively for pure search operations in a 25-server cluster
- …