1,282 research outputs found
TriCheck: Memory Model Verification at the Trisection of Software, Hardware, and ISA
Memory consistency models (MCMs) which govern inter-module interactions in a
shared memory system, are a significant, yet often under-appreciated, aspect of
system design. MCMs are defined at the various layers of the hardware-software
stack, requiring thoroughly verified specifications, compilers, and
implementations at the interfaces between layers. Current verification
techniques evaluate segments of the system stack in isolation, such as proving
compiler mappings from a high-level language (HLL) to an ISA or proving
validity of a microarchitectural implementation of an ISA.
This paper makes a case for full-stack MCM verification and provides a
toolflow, TriCheck, capable of verifying that the HLL, compiler, ISA, and
implementation collectively uphold MCM requirements. The work showcases
TriCheck's ability to evaluate a proposed ISA MCM in order to ensure that each
layer and each mapping is correct and complete. Specifically, we apply TriCheck
to the open source RISC-V ISA, seeking to verify accurate, efficient, and legal
compilations from C11. We uncover under-specifications and potential
inefficiencies in the current RISC-V ISA documentation and identify possible
solutions for each. As an example, we find that a RISC-V-compliant
microarchitecture allows 144 outcomes forbidden by C11 to be observed out of
1,701 litmus tests examined. Overall, this paper demonstrates the necessity of
full-stack verification for detecting MCM-related bugs in the hardware-software
stack.Comment: Proceedings of the Twenty-Second International Conference on
Architectural Support for Programming Languages and Operating System
Improving the Performance and Endurance of Persistent Memory with Loose-Ordering Consistency
Persistent memory provides high-performance data persistence at main memory.
Memory writes need to be performed in strict order to satisfy storage
consistency requirements and enable correct recovery from system crashes.
Unfortunately, adhering to such a strict order significantly degrades system
performance and persistent memory endurance. This paper introduces a new
mechanism, Loose-Ordering Consistency (LOC), that satisfies the ordering
requirements at significantly lower performance and endurance loss. LOC
consists of two key techniques. First, Eager Commit eliminates the need to
perform a persistent commit record write within a transaction. We do so by
ensuring that we can determine the status of all committed transactions during
recovery by storing necessary metadata information statically with blocks of
data written to memory. Second, Speculative Persistence relaxes the write
ordering between transactions by allowing writes to be speculatively written to
persistent memory. A speculative write is made visible to software only after
its associated transaction commits. To enable this, our mechanism supports the
tracking of committed transaction ID and multi-versioning in the CPU cache. Our
evaluations show that LOC reduces the average performance overhead of memory
persistence from 66.9% to 34.9% and the memory write traffic overhead from
17.1% to 3.4% on a variety of workloads.Comment: This paper has been accepted by IEEE Transactions on Parallel and
Distributed System
Store Atomicity for Transactional Memory
AbstractWe extend the notion of Store Atomicity [Arvind and Jan-Willem Maessen. Memory model = instruction reordering + store atomicity. In ISCA '06: Proceedings of the 33rd annual International Symposium on Computer Architecture, 2006] to a system with atomic transactional memory. This gives a fine-grained graph-based framework for defining and reasoning about transactional memory consistency. The memory model is defined in terms of thread-local Instruction Reordering axioms and Store Atomicity, which describes inter-thread communication via memory. A memory model with Store Atomicity is serializable: there is a unique global interleaving of all operations which respects the reordering rules and serializes all the operations in a transaction together. We extend Store Atomicity to capture this ordering requirement by requiring dependencies which cross a transaction boundary to point in to the initiating instruction or out from the committing instruction. We sketch a weaker definition of transactional serialization which accounts for the ability to interleave transactional operations which touch disjoint memory. We give a procedure for enumerating the behaviors of a transactional program—noting that a safe enumeration procedure permits only one transaction to read from memory at a time. We show that more realistic models of transactional execution require speculative execution. We define the conditions under which speculation must be rolled back, and give criteria to identify which instructions must be rolled back in these cases
CELLO: Compiler-Assisted Efficient Load-Load Ordering in Data-Race-Free Regions
Efficient Total Store Order (TSO) implementations allow loads to execute speculatively out-of-order. To detect order violations, the load queue (LQ) holds all the in-flight loads and is searched on every invalidation and cache eviction. Moreover, in a simultaneous multithreading processor (SMT), stores also search the LQ when writing to cache. LQ searches entail considerable energy consumption. Furthermore, the processor stalls upon encountering the LQ full or when its ports are busy. Hence, the LQ is a critical structure in terms of both energy and performance.
In this work, we observe that the use of the LQ could be dramatically optimized under the guarantees of the datarace-free (DRF) property imposed by modern programming languages. To leverage this observation, we propose CELLO, a
software-hardware co-design in which the compiler detects memory operations in DRF regions and the hardware optimizes their execution by safely skipping LQ searches without violating the TSO consistency model. Furthermore, CELLO allows removing DRF loads from the LQ earlier, as they do not need to be searched to detect consistency violations.
With minimal hardware overhead, we show that an 8-core 2- way SMT processor with CELLO avoids almost all conservative searches to the LQ and significantly reduces its occupancy. CELLO allows i) to reduce the LQ energy expenditure by 33% on average (up to 53%) while performing 2.8% better on average (up to 18.6%) than the baseline system, and ii) to shrink the LQ size from 192 to only 80 entries, reducing the LQ energy expenditure as much as 69% while performing on par with a
mainstream LQ implementation
- …