132 research outputs found
Improving the Performance and Endurance of Persistent Memory with Loose-Ordering Consistency
Persistent memory provides high-performance data persistence at main memory.
Memory writes need to be performed in strict order to satisfy storage
consistency requirements and enable correct recovery from system crashes.
Unfortunately, adhering to such a strict order significantly degrades system
performance and persistent memory endurance. This paper introduces a new
mechanism, Loose-Ordering Consistency (LOC), that satisfies the ordering
requirements at significantly lower performance and endurance loss. LOC
consists of two key techniques. First, Eager Commit eliminates the need to
perform a persistent commit record write within a transaction. We do so by
ensuring that we can determine the status of all committed transactions during
recovery by storing necessary metadata information statically with blocks of
data written to memory. Second, Speculative Persistence relaxes the write
ordering between transactions by allowing writes to be speculatively written to
persistent memory. A speculative write is made visible to software only after
its associated transaction commits. To enable this, our mechanism supports the
tracking of committed transaction ID and multi-versioning in the CPU cache. Our
evaluations show that LOC reduces the average performance overhead of memory
persistence from 66.9% to 34.9% and the memory write traffic overhead from
17.1% to 3.4% on a variety of workloads.Comment: This paper has been accepted by IEEE Transactions on Parallel and
Distributed System
Compiler-Driven Software Speculation for Thread-Level Parallelism
Current parallelizing compilers can tackle applications exercising regular access patterns on arrays or affine indices, where data dependencies can be expressed in a linear form. Unfortunately, there are cases that independence between statements of code cannot be guaranteed and thus the compiler conservatively produces sequential code. Programs that involve extensive pointer use, irregular access patterns, and loops with unknown number of iterations are examples of such cases. This limits the extraction of parallelism in cases where dependencies are rarely or never triggered at runtime. Speculative parallelism refers to methods employed during program execution that aim to produce a valid parallel execution schedule for programs immune to static parallelization. The motivation for this article is to review recent developments in the area of compiler-driven software speculation for thread-level parallelism and how they came about. The article is divided into two parts. In the first part the fundamentals of speculative parallelization for thread-level parallelism are explained along with a design choice categorization for implementing such systems. Design choices include the ways speculative data is handled, how data dependence violations are detected and resolved, how the correct data are made visible to other threads, or how speculative threads are scheduled. The second part is structured around those design choices providing the advances and trends in the literature with reference to key developments in the area. Although the focus of the article is in software speculative parallelization, a section is dedicated for providing the interested reader with pointers and references for exploring similar topics such as hardware thread-level speculation, transactional memory, and automatic parallelization
Efficient, scalable, and fair read-modify-writes
Read-Modify-Write (RMW) operations, or atomics, have widespread application in
(a) synchronization, where they are used as building blocks of various synchronization
constructs like locks, barriers, and lock-free data structures (b) supervised memory systems,
where every memory operation is effectively an RMW that reads and modifies
metadata associated with memory addresses and (c) profiling, where RMW instructions
are used to increment shared counters to convey meaningful statistics about a
program. In each of these scenarios, the RMWs pose a bottleneck to performance and
scalability. We observed that the cost of RMWs is dependent on two major factors â
the memory ordering enforced by the RMW, and contention amongst processors performing
RMWs to the same memory address. In the case of both synchronization and
supervised memory systems, the RMWs are expensive due to the memory ordering
enforced due to the atomic RMW operation. Performance overhead due to contention
is more prevalent in parallel programs which frequently make use of RMWs to update
concurrent data structures in a non-blocking manner. Such programs also suffer from a
degradation in fairness amongst concurrent processors. In this thesis, we study the cost
of RMWs in the above applications, and present solutions to obtain better performance
and scalability from RMW operations.
Firstly, this thesis tackles the large overhead of RMW instructions when used for
synchronization in the widely used x86 processor architectures, like in Intel, AMD, and
Sun processors. The x86 processor architecture implements a variation of the Total-Store-Order (TSO) memory consistency model. RMW instructions in existing TSO architectures
(we call them type-1 RMW) are ordered like memory fences, which makes
them expensive. The strong fence-like ordering of type-1 RMWs is unnecessary for the
memory ordering required by synchronization. We propose weaker RMW instructions
for TSO consistency; we consider two weaker definitions: type-2 and type-3, each
causing subtle ordering differences. Type-2 and type-3 RMWs avoid the fence-like
ordering of type-1 RMWs, thereby reducing their overhead. Recent work has shown
that the new C/C++11 memory consistency model can be realized by generating type-1 RMWs for SC-atomic-writes and/or SC-atomic-reads. We formally prove that this
is equally valid for the proposed type-2 RMWs, and partially for type-3 RMWs. We
also propose efficient implementations for type-2 (type-3) RMWs. Simulation results
show that our implementation reduces the cost of an RMW by up to 58.9% (64.3%),
which translates into an overall performance improvement of up to 9.0% (9.2%) for
the programs considered.
Next, we argue the case for an efficient and correct supervised memory system
for the TSO memory consistency model. Supervised memory systems make use of
RMW-like supervised memory instructions (SMIs) to atomically update metadata associated
with every memory address used by an application program. Such a system is
used to help increase reliability, security and accuracy of parallel programs by offering
debugging/monitoring features. Most existing supervised memory systems assume a
sequentially consistent memory. For weaker consistency models, like TSO, correctness
issues (like imprecise exceptions) arise if the ordering requirement of SMIs is
neglected. In this thesis, we show that it is sufficient for supervised instructions to only
read and process their metadata in order to ensure correctness. We propose SuperCoP,
a supervised memory system for relaxed memory models in which SMIs read and process
metadata before retirement, while allowing data and metadata writes to retire into
the write-buffer. Our experimental results show that SuperCoP performs better than
the existing state-of-the-art correct supervision system by 16.8%.
Finally, we address the issue of contention and contention-based failure of RMWs
in non-blocking synchronization mechanisms. We leverage the fact that most existing
lock-free programs make use of compare-and-swap (CAS) loops to access the
concurrent data structure. We propose DyFCoM (Dynamic Fairness and Contention
Management), a holistic scheme which addresses both throughput and fairness under
increased contention. DyFCoM monitors the number of successful and failed RMWs
in each thread, and uses this information to implement a dynamic backoff scheme to
optimize throughput. We also use this information to throttle faster threads and give
slower threads a higher chance of performing their lock-free operations, to increase
fairness among threads. Our experimental results show that our contention management
scheme alone performs better than the existing state-of-the-art CAS contention
management scheme by an average of 7.9%. When fairness management is included,
our scheme provides an average of 3.4% performance improvement over the constant
backoff scheme, while showing increased fairness values in all cases (up to 43.6%)
ParaLog: enabling and accelerating online parallel monitoring of multithreaded applications
Instruction-grain lifeguards monitor the events of a running application at the level of individual instructions in order to identify and help mitigate application bugs and security exploits. Because such lifeguards impose a 10-100X slowdown on existing platforms, previous studies have proposed hardware designs to accelerate lifeguard processing. However, these accelerators are either tailored to a specific class of lifeguards or suitable only for monitoring singlethreaded programs. We present ParaLog, the first design of a system enabling fast online parallel monitoring of multithreaded parallel applications. ParaLog supports a broad class of software-defined lifeguards. We show how three existing accelerators can be enhanced to support online multithreaded monitoring, dramatically reducing lifeguard overheads. We identify and solve several challenges in monitoring parallel applications and/or parallelizing these accelerators, including (i) enforcing inter-thread data dependences, (ii) dealing with inter-thread effects that are not reflected in coherence traffic, (iii) dealing with unmonitored operating system activity, and (iv) ensuring lifeguards can access shared metadata with negligible synchronization overheads. We present our system design for both Sequentially Consistent and Total Store Ordering processors. We implement and evaluate our design on a 16 core simulated CMP, using benchmarks from SPLASH-2 and PARSEC and two lifeguards: a data-flow tracking lifeguard and a memory-access checker lifeguard. Our results show that (i) our parallel accelerators improve performance by 2-9X and 1.13-3.4X for our two lifeguards, respectively, (ii) we are 5-126X faster than the time-slicing approach required by existing techniques, and (iii) our average overheads for applications with eight threads are 51% and 28% for the two lifeguards, respectively
Recommended from our members
Efficient tagged memory
We characterize the cache behavior of an in-memory tag table and
demonstrate that an optimized implementation can typically achieve a near-zero memory traffic overhead. Both industry and academia have repeatedly demonstrated tagged memory as a key mechanism to enable enforcement of powerful security invariants, including capabilities pointer integrity, watchpoints, and information-flow tracking. A single-bit tag shadowspace is the most commonly proposed requirement, as one bit is the minimum metadata needed to distinguish between an untyped data word and any number of new hardware-enforced types. We survey various tag shadowspace approaches and identify their common requirements and positive features of their implementations. To avoid non-standard memory widths, we identify the most practical implementation for tag storage to be an in-memory table managed next to the DRAM controller. We characterize the caching performance of such a tag table and demonstrate a DRAM traffic overhead below 5\% for the vast majority of applications. We identify spatial locality on a page scale as the primary factor that enables surprisingly high table cache-ability. We then demonstrate tag-table compression for a set of common applications. A hierarchical structure with elegantly simple optimizations reduces DRAM traffic overhead to below 1\% for most applications. These insights and optimizations pave the way for commercial applications making use of single-bit tags stored in commodity memory
Mechanisms for Unbounded, Conflict-Robust Hardware Transactional Memory
Conventional lock implementations serialize access to critical sections guarded by the same lock, presenting programmers with a difficult tradeoff between granularity of synchronization and amount of parallelism realized. Recently, researchers have been investigating an emerging synchronization mechanism called transactional memory as an alternative to such conventional lock-based synchronization. Memory transactions have the semantics of executing in isolation from one another while in reality executing speculatively in parallel, aborting when necessary to maintain the appearance of isolation. This combination of coarse-grained isolation and optimistic parallelism has the potential to ease the tradeoff presented by lock-based programming.
This dissertation studies the hardware implementation of transactional memory, making three main contributions. First, we propose the permissions-only cache, a mechanism that efficiently increases the size of transactions that can be handled in the local cache hierarchy to optimize performance. Second, we propose OneTM, an unbounded hardware transactional memory system that serializes transactions that escape the local cache hierarchy. Finally, we propose RetCon, a novel mechanism for detecting conflicts that reduces conflicts by allowing transactions to commit with different values than those with which they executed as long as dataflow and control-flow constraints are maintained
Endurable Transient Inconsistency in Byte-Addressable Persistent B+-Tree
Department of Computer Science and EngineeringWith the emergence of byte-addressable persistent memory (PM), a cache line, instead of a page, is expected to be the unit of data transfer between volatile and non-volatile devices, but the failure-atomicity of write operations is guaranteed in the granularity of 8 bytes rather than cache lines. This granularity mismatch problem has generated interest in redesigning block-based data structures such as B+-trees. However, various methods of modifying B+-trees for PM degrade the efficiency of B+-trees, and attempts have been made to use in-memory data structures for PM.
In this study, we develop Failure-Atomic ShifT (FAST) and Failure-Atomic In-place Rebalance (FAIR) algorithms to resolve the granularity mismatch problem. Every 8-byte store instruction used in the FAST and FAIR algorithms transforms a B+-tree into another consistent state or a {\it transient inconsistent} state that read operations can tolerate. By making read operations tolerate transient inconsistency, we can avoid expensive copy-on-write, logging, and even the necessity of read latches so that read transactions can be non-blocking. Our experimental results show that legacy B+-trees with FAST and FAIR schemes outperform the state-of-the-art persistent indexing structures by a large margin.clos
- âŠ