66,765 research outputs found
HRF-relaxed: Adapting HRF to the complexities of industrial heterogeneous memory models
Memory consistency models, or memory models, allow both programmers and program language implementers to reason about concurrent accesses to one or more memory locations. Memory model specifications balance the often conflicting needs for precise semantics, implementation flexibility, and ease of understanding. Toward that end, popular programming languages like Java, C, and C++ have adopted memory models built on the conceptual foundation of Sequential Consistency for Data-Race-Free programs (SC for DRF). These SC for DRF languages were created with general-purpose homogeneous CPU systems in mind, and all assume a single, global memory address space. Such a uniform address space is usually power and performance prohibitive in heterogeneous Systems on Chips (SoCs), and for that reason most heterogeneous languages have adopted split address spaces and operations with nonglobal visibility.There have recently been two attempts to bridge the disconnect between the CPU-centric assumptions of the SC for DRF framework and the realities of heterogeneous SoC architectures. Hower et al. proposed a class of Heterogeneous-Race-Free (HRF) memory models that provide a foundation for understanding many of the issues in heterogeneous memory models. At the same time, the Khronos Group developed the OpenCL 2.0 memory model that builds on the C++ memory model. The OpenCL 2.0 model includes features not addressed by HRF: primarily support for relaxed atomics and a property referred to as scope inclusion. In this article, we generalize HRF to allow formalization of and reasoning about more complicated models using OpenCL 2.0 as a point of reference. With that generalization, we (1) make the OpenCL 2.0 memory model more accessible by introducing a platform for feature comparisons to other models, (2) consider a number of shortcomings in the current OpenCL 2.0 model, and (3) propose changes that could be adopted by future OpenCL 2.0 revisions or by other, related, models
Efficient coherence and consistency for specialized memory hierarchies
As the benefits from transistor scaling slow down, specialization is becoming increasingly important for a wide range of applications. Although traditional heterogeneous systems work well for streaming, data parallel applications, they are inefficient for emerging applications, like graph analytics workloads, with fine-grained synchronization, relaxed atomics, and more general sharing patterns. Heterogeneous systems are also difficult to program, which makes it harder for programmers to take advantage of the potential benefits of specialization.
This thesis redesigns the memory hierarchy of heterogeneous systems to make heterogeneous systems more efficient and easier to use. In particular, we focus on three key sources of inefficiency in the memory hierarchy of modern heterogeneous systems: (1) a unified global address space, (2) the cache coherence protocol, and (3) the memory consistency model.
A unified global address space makes it easier to write programs for heterogeneous systems. Although industry has recently begun to provide a unified global address space across CPUs and accelerators (primarily GPUs), there are many inefficiencies. For example, emerging applications with fine-grained synchronization need better support for coherence and consistency. We find that simple coherence and complex consistency are key sources of inefficiency. To resolve this problem, we adjust the division of complexity between the cache coherence protocol and memory consistency model: we introduce DeNovo for accelerators (DeNovoA), which extends DeNovo’s hybrid, software-driven hardware coherence protocol to heterogeneous systems. Unlike current coherence protocols for heterogeneous systems, DeNovoA obtains ownership for written data, enables heterogeneous systems to use the simpler sequentially consistent for data-race-free (SC-for-DRF, or DRF) memory consistency model, and provides both efficiency and programmability. Across a wide variety of applications, DeNovoA with a DRF memory consistency model either outperforms or provides comparable efficiency to a the state-of-the-art approach.
Although DRF is easier to use and works well for most applications, there are some corner cases where its overheads are unnecessary and hurt performance. This led to the introduction of relaxed atomics in the memory consistency models for multi-core CPUs and heterogeneous systems. Although relaxed atomics can significantly improve performance, they are very difficult to use correctly. We address the impact of relaxed atomics on memory consistency models for heterogeneous systems by creating a new memory consistency model, Data-Race-Free-Relaxed or DRFrlx. DRFrlx extends the existing DRF memory consistency models to provide SC-centric semantics for all common uses of relaxed atomics in heterogeneous systems while retaining their efficiency benefits. Thus, DRFrlx makes it easier for programmers to safely use relaxed atomics.
Although current heterogeneous systems are adopting unified global address spaces, specialized memories such as scratchpads still exist in disjoint, private address spaces. This increases programming complexity and causes inefficiencies that negate some of the benefits of specialization. We introduce a new memory organization, stash, that mitigates the inefficiencies of specialized memories by integrating them into the coherent, globally visible address space. Stash makes it easier for programmers to use specialized memories and retains their efficiency benefits.
Finally, to better understand the tradeoffs and scalability of different coherence protocols and consistency models, we created a suite of synchronization microbenchmarks, HeteroSync. HeteroSync contains various fine-grained synchronization and relaxed atomics algorithms. Moreover, HeteroSync is highly configurable and provides a standard set of fine-grained synchronization microbenchmarks to compare the efficiency of different approaches.
In summary, this thesis questions the state-of-the-art approaches for designing memory hierarchies of heterogeneous systems, and shows that the current techniques provide neither efficiency nor programmability for emerging workloads. We demonstrate how DeNovoA with a DRFrlx memory consistency model improves efficiency and programmability for many heterogeneous applications and makes it easier for programmers to use heterogeneous systems
Programming MPSoC platforms: Road works ahead
This paper summarizes a special session on multicore/multi-processor system-on-chip (MPSoC) programming challenges. The current trend towards MPSoC platforms in most computing domains does not only mean a radical change in computer architecture. Even more important from a SW developer´s viewpoint, at the same time the classical sequential von Neumann programming model needs to be overcome. Efficient utilization of the MPSoC HW resources demands for radically new models and corresponding SW development tools, capable of exploiting the available parallelism and guaranteeing bug-free parallel SW. While several standards are established in the high-performance computing domain (e.g. OpenMP), it is clear that more innovations are required for successful\ud
deployment of heterogeneous embedded MPSoC. On the other hand, at least for coming years, the freedom for disruptive programming technologies is limited by the huge amount of certified sequential code that demands for a more pragmatic, gradual tool and code replacement strategy
Remote-scope Promotion: Clarified, Rectified, and Verified
Modern accelerator programming frameworks, such as OpenCL, organise threads into work-groups. Remote-scope promotion (RSP) is a language extension recently proposed by AMD researchers that is designed to enable applications, for the first time, both to optimise for the common case of intra-work-group communication (using memory scopes to provide consistency only within a work-group) and to allow occasional inter-work-group communication (as required, for instance, to support the popular load-balancing idiom of work stealing). We present the first formal, axiomatic memory model of OpenCL extended with RSP. We have extended the Herd memory model simulator with support for OpenCL kernels that exploit RSP, and used it to discover bugs in several litmus tests and a work-stealing queue, that have been used previously in the study of RSP. We have also formalised the proposed GPU implementation of RSP. The formalisation process allowed us to identify bugs in the description of RSP that could result in well-synchronised programs experiencing memory inconsistencies. We present and prove sound a new implementation of RSP that incorporates bug fixes and requires less non-standard hardware than the original implementation. This work, a collaboration between academia and industry, clearly demonstrates how, when designing hardware support for a new concurrent language feature, the early application of formal tools and techniques can help to prevent errors, such as those we have found, from making it into silicon
Engineering a static verification tool for GPU kernels
We report on practical experiences over the last 2.5 years related to the engineering of GPUVerify, a static verification tool for OpenCL and CUDA GPU kernels, plotting the progress of GPUVerify from a prototype to a fully functional and relatively efficient analysis tool. Our hope is that this experience report will serve the verification community by helping to inform future tooling efforts. © 2014 Springer International Publishing
Breadth First Search Vectorization on the Intel Xeon Phi
Breadth First Search (BFS) is a building block for graph algorithms and has
recently been used for large scale analysis of information in a variety of
applications including social networks, graph databases and web searching. Due
to its importance, a number of different parallel programming models and
architectures have been exploited to optimize the BFS. However, due to the
irregular memory access patterns and the unstructured nature of the large
graphs, its efficient parallelization is a challenge. The Xeon Phi is a
massively parallel architecture available as an off-the-shelf accelerator,
which includes a powerful 512 bit vector unit with optimized scatter and gather
functions. Given its potential benefits, work related to graph traversing on
this architecture is an active area of research.
We present a set of experiments in which we explore architectural features of
the Xeon Phi and how best to exploit them in a top-down BFS algorithm but the
techniques can be applied to the current state-of-the-art hybrid, top-down plus
bottom-up, algorithms.
We focus on the exploitation of the vector unit by developing an improved
highly vectorized OpenMP parallel algorithm, using vector intrinsics, and
understanding the use of data alignment and prefetching. In addition, we
investigate the impact of hyperthreading and thread affinity on performance, a
topic that appears under researched in the literature. As a result, we achieve
what we believe is the fastest published top-down BFS algorithm on the version
of Xeon Phi used in our experiments. The vectorized BFS top-down source code
presented in this paper can be available on request as free-to-use software
Deterministic Consistency: A Programming Model for Shared Memory Parallelism
The difficulty of developing reliable parallel software is generating
interest in deterministic environments, where a given program and input can
yield only one possible result. Languages or type systems can enforce
determinism in new code, and runtime systems can impose synthetic schedules on
legacy parallel code. To parallelize existing serial code, however, we would
like a programming model that is naturally deterministic without language
restrictions or artificial scheduling. We propose "deterministic consistency",
a parallel programming model as easy to understand as the "parallel assignment"
construct in sequential languages such as Perl and JavaScript, where concurrent
threads always read their inputs before writing shared outputs. DC supports
common data- and task-parallel synchronization abstractions such as fork/join
and barriers, as well as non-hierarchical structures such as producer/consumer
pipelines and futures. A preliminary prototype suggests that software-only
implementations of DC can run applications written for popular parallel
environments such as OpenMP with low (<10%) overhead for some applications.Comment: 7 pages, 3 figure
The Naming Game in Social Networks: Community Formation and Consensus Engineering
We study the dynamics of the Naming Game [Baronchelli et al., (2006) J. Stat.
Mech.: Theory Exp. P06014] in empirical social networks. This stylized
agent-based model captures essential features of agreement dynamics in a
network of autonomous agents, corresponding to the development of shared
classification schemes in a network of artificial agents or opinion spreading
and social dynamics in social networks. Our study focuses on the impact that
communities in the underlying social graphs have on the outcome of the
agreement process. We find that networks with strong community structure hinder
the system from reaching global agreement; the evolution of the Naming Game in
these networks maintains clusters of coexisting opinions indefinitely. Further,
we investigate agent-based network strategies to facilitate convergence to
global consensus.Comment: The original publication is available at
http://www.springerlink.com/content/70370l311m1u0ng3
- …