16,294 research outputs found
A wide-spectrum language for verification of programs on weak memory models
Modern processors deploy a variety of weak memory models, which for
efficiency reasons may (appear to) execute instructions in an order different
to that specified by the program text. The consequences of instruction
reordering can be complex and subtle, and can impact on ensuring correctness.
Previous work on the semantics of weak memory models has focussed on the
behaviour of assembler-level programs. In this paper we utilise that work to
extract some general principles underlying instruction reordering, and apply
those principles to a wide-spectrum language encompassing abstract data types
as well as low-level assembler code. The goal is to support reasoning about
implementations of data structures for modern processors with respect to an
abstract specification.
Specifically, we define an operational semantics, from which we derive some
properties of program refinement, and encode the semantics in the rewriting
engine Maude as a model-checking tool. The tool is used to validate the
semantics against the behaviour of a set of litmus tests (small assembler
programs) run on hardware, and also to model check implementations of data
structures from the literature against their abstract specifications
Defining correctness conditions for concurrent objects in multicore architectures
Correctness of concurrent objects is defined in terms of conditions that determine allowable relationships between histories of a concurrent object and those of the corresponding sequential object. Numerous correctness conditions have been proposed over the years, and more have been proposed recently as the algorithms implementing concurrent objects have been adapted to cope with multicore processors with relaxed memory architectures. We present a formal framework for defining correctness conditions for multicore architectures, covering both standard conditions for totally ordered memory and newer conditions for relaxed
memory, which allows them to be expressed in uniform manner, simplifying comparison. Our framework distinguishes between order and commitment properties, which in turn enables a hierarchy of correctness conditions to be established. We consider the Total Store Order (TSO) memory model in detail, formalise known conditions for TSO using our framework, and develop sequentially consistent variations of these. We present a work-stealing deque for TSO memory that is not linearizable, but is correct with respect to these new conditions. Using our framework, we identify a new non-blocking compositional condition, fence consistency, which lies between known conditions for TSO, and aims to capture the intention of a programmer-specified fence
High-level synthesis of fine-grained weakly consistent C concurrency
High-level synthesis (HLS) is the process of automatically compiling high-level programs into a netlist (collection of gates). Given an input program, HLS tools exploit its inherent parallelism and pipelining opportunities to generate efficient customised hardware. C-based programs are the most popular input for HLS tools, but these tools historically only synthesise sequential C programs. As the appeal for software concurrency rises, HLS tools are beginning to synthesise concurrent C programs, such as C/C++ pthreads and OpenCL. Although supporting software concurrency leads to better hardware parallelism, shared memory synchronisation is typically serialised to ensure correct memory behaviour, via locks. Locks are safety resources that ensure exclusive access of shared memory, eliminating data races and providing synchronisation guarantees for programmers.Â
As an alternative to lock-based synchronisation, the C memory model also defines the possibility of lock-free synchronisation via fine-grained atomic operations (`atomics'). However, most HLS tools either do not support atomics at all or implement atomics using locks. Instead, we treat the synthesis of atomics as a scheduling problem. We show that we can augment the intra-thread memory constraints during memory scheduling of concurrent programs to support atomics. On average, hardware generated by our method is 7.5x faster than the state-of-the-art, for our set of experiments.
Our method of synthesising atomics enables several unique possibilities. Chiefly, we are capable of supporting weakly consistent (`weak') atomics, which necessitate fewer ordering constraints compared to sequentially consistent (SC) atomics. However, implementing weak atomics is complex and error-prone and hence we formally verify our methods via automated model checking to ensure our generated hardware is correct. Furthermore, since the C memory model defines memory behaviour globally, we can globally analyse the entire program to generate its memory constraints. Additionally, we can also support loop pipelining by extending our methods to generate inter-iteration memory constraints. On average, weak atomics, global analysis and loop pipelining improve performance by 1.6x, 3.4x and 1.4x respectively, for our set of experiments. Finally, we present a case study of a real-world example via an HLS-based Google PageRank algorithm, whose performance improves by 4.4x via lock-free streaming and work-stealing.Open Acces
Property-Driven Fence Insertion using Reorder Bounded Model Checking
Modern architectures provide weaker memory consistency guarantees than
sequential consistency. These weaker guarantees allow programs to exhibit
behaviours where the program statements appear to have executed out of program
order. Fortunately, modern architectures provide memory barriers (fences) to
enforce the program order between a pair of statements if needed. Due to the
intricate semantics of weak memory models, the placement of fences is
challenging even for experienced programmers. Too few fences lead to bugs
whereas overuse of fences results in performance degradation. This motivates
automated placement of fences. Tools that restore sequential consistency in the
program may insert more fences than necessary for the program to be correct.
Therefore, we propose a property-driven technique that introduces
"reorder-bounded exploration" to identify the smallest number of program
locations for fence placement. We implemented our technique on top of CBMC;
however, in principle, our technique is generic enough to be used with any
model checker. Our experimental results show that our technique is faster and
solves more instances of relevant benchmarks as compared to earlier approaches.Comment: 18 pages, 3 figures, 4 algorithms. Version change reason : new set of
results and publication ready version of FM 201
Lace: non-blocking split deque for work-stealing
Work-stealing is an efficient method to implement load balancing in fine-grained task parallelism. Typically, concurrent deques are used for this purpose. A disadvantage of many concurrent deques is that they require expensive memory fences for local deque operations.\ud
\ud
In this paper, we propose a new non-blocking work-stealing deque based on the split task queue. Our design uses a dynamic split point between the shared and the private portions of the deque, and only requires memory fences when shrinking the shared portion.\ud
\ud
We present Lace, an implementation of work-stealing based on this deque, with an interface similar to the work-stealing library Wool, and an evaluation of Lace based on several common benchmarks. We also implement a recent approach using private deques in Lace. We show that the split deque and the private deque in Lace have similar low overhead and high scalability as Wool
- …