224 research outputs found
TriCheck: Memory Model Verification at the Trisection of Software, Hardware, and ISA
Memory consistency models (MCMs) which govern inter-module interactions in a
shared memory system, are a significant, yet often under-appreciated, aspect of
system design. MCMs are defined at the various layers of the hardware-software
stack, requiring thoroughly verified specifications, compilers, and
implementations at the interfaces between layers. Current verification
techniques evaluate segments of the system stack in isolation, such as proving
compiler mappings from a high-level language (HLL) to an ISA or proving
validity of a microarchitectural implementation of an ISA.
This paper makes a case for full-stack MCM verification and provides a
toolflow, TriCheck, capable of verifying that the HLL, compiler, ISA, and
implementation collectively uphold MCM requirements. The work showcases
TriCheck's ability to evaluate a proposed ISA MCM in order to ensure that each
layer and each mapping is correct and complete. Specifically, we apply TriCheck
to the open source RISC-V ISA, seeking to verify accurate, efficient, and legal
compilations from C11. We uncover under-specifications and potential
inefficiencies in the current RISC-V ISA documentation and identify possible
solutions for each. As an example, we find that a RISC-V-compliant
microarchitecture allows 144 outcomes forbidden by C11 to be observed out of
1,701 litmus tests examined. Overall, this paper demonstrates the necessity of
full-stack verification for detecting MCM-related bugs in the hardware-software
stack.Comment: Proceedings of the Twenty-Second International Conference on
Architectural Support for Programming Languages and Operating System
A portable platform for accelerated PIC codes and its application to GPUs using OpenACC
We present a portable platform, called PIC_ENGINE, for accelerating
Particle-In-Cell (PIC) codes on heterogeneous many-core architectures such as
Graphic Processing Units (GPUs). The aim of this development is efficient
simulations on future exascale systems by allowing different parallelization
strategies depending on the application problem and the specific architecture.
To this end, this platform contains the basic steps of the PIC algorithm and
has been designed as a test bed for different algorithmic options and data
structures. Among the architectures that this engine can explore, particular
attention is given here to systems equipped with GPUs. The study demonstrates
that our portable PIC implementation based on the OpenACC programming model can
achieve performance closely matching theoretical predictions. Using the Cray
XC30 system, Piz Daint, at the Swiss National Supercomputing Centre (CSCS), we
show that PIC_ENGINE running on an NVIDIA Kepler K20X GPU can outperform the
one on an Intel Sandybridge 8-core CPU by a factor of 3.4
Recommended from our members
Scalable Emulation of Heterogeneous Systems
The breakdown of Dennard's transistor scaling has driven computing systems toward application-specific accelerators, which can provide orders-of-magnitude improvements in performance and energy efficiency over general-purpose processors.
To enable the radical departures from conventional approaches that heterogeneous systems entail, research infrastructure must be able to model processors, memory and accelerators, as well as system-level changes---such as operating system or instruction set architecture (ISA) innovations---that might be needed to realize the accelerators' potential. Unfortunately, existing simulation tools that can support such system-level research are limited by the lack of fast, scalable machine emulators to drive execution.
To fill this need, in this dissertation we first present a novel machine emulator design based on dynamic binary translation that makes the following improvements over the state of the art: it scales on multicore hosts while remaining memory efficient, correctly handles cross-ISA differences in atomic instruction semantics, leverages the host floating point (FP) unit to speed up FP emulation without sacrificing correctness, and can be efficiently instrumented to---among other possible uses---drive the execution of a full-system, cross-ISA simulator with support for accelerators.
We then demonstrate the utility of machine emulation for studying heterogeneous systems by leveraging it to make two additional contributions. First, we quantify the trade-offs in different coupling models for on-chip accelerators. Second, we present a technique to reuse the private memories of on-chip accelerators when they are otherwise inactive to expand the system's last-level cache, thereby reducing the opportunity cost of the accelerators' integration
Transactional memory on heterogeneous architectures
Tesis Leida el 9 de Marzo de 2018.Si observamos las necesidades computacionales de hoy, y tratamos de predecir
las necesidades del mañana, podemos concluir que el procesamiento heterogéneo
estará presente en muchos dispositivos y aplicaciones.
El motivo es lĂłgico: algoritmos diferentes y datos de naturaleza diferente encajan mejor
en unos dispositivos de cĂłmputo que en otros. Pongamos como ejemplo una
tecnologĂa de vanguardia como son los vehĂculos inteligentes. En este tipo de
aplicaciones la computación heterogénea no es una opción, sino un requisito.
En este tipo de vehĂculos se recolectan y analizan imágenes, tarea para la cual
los procesadores gráficos (GPUs) son muy eficientes.
Muchos de estos vehĂculos utilizan algoritmos sencillos,
pero con grandes requerimientos de tiempo real, que deben
implementarse directamente en hardware utilizando FPGAs.
Y, por supuesto, los procesadores multinĂşcleo tienen un
papel fundamental en estos sistemas, tanto organizando el trabajo de otros coprocesadores
como ejecutando tareas en las que ningĂşn otro procesador
es más eficiente. No obstante, los procesadores tampoco siguen siendo dispositivos
homogéneos. Los diferentes núcleos de un procesador pueden
ofrecer diferentes caracterĂsticas en tĂ©rminos de potencia y consumo
energético que se adapten a las necesidades de cómputo de la aplicación.
Programar este conjunto de dispositivos es una tarea compleja, especialmente
en su sincronizaciĂłn.
Habitualmente, esta sincronizaciĂłn se basa en operaciones atĂłmicas, ejecuciĂłn y
terminación de kernels, barreras y señales. Con estas primitivas de sincronización
básicas se pueden construir otras estructuras más complejas.
Sin embargo, la programaciĂłn de estos
mecanismos es tediosa y propensa a fallos. La memoria transaccional
(TM por sus siglas en inglés) se ha propuesto como un mecanismo
avanzado a la vez que simple para garantizar la exclusiĂłn mutua
Scheduling Irregular Workloads on GPUs
This doctoral research aims at understanding the nature of the overhead for data irregular GPU workloads, proposing a solution, and examining the consequences of the result. We propose a novel, retry-free GPU workload scheduler for irregular workloads. When used in a Breadth First Search (BFS) algorithm, the proposed simple, monolithic concurrent queue scales to within 10% of ideal scalability on AMD’s Fiji GPU with 14,336 active threads. The dissertation presents an important finding that the retry overhead associated with Compare and Swap (CAS) operations is the principle reason why concurrent queues do not scale well as the number of clients increases in a massively multi-threaded environment
The Case For Heterogeneous HTAP
ABSTRACT Modern database engines balance the demanding requirements of mixed, hybrid transactional and analytical processing (HTAP) workloads by relying on i) global shared memory, ii) system-wide cache coherence, and iii) massive parallelism. Thus, database engines are typically deployed on multi-socket multi-cores, which have been the only platform to support all three aspects. Two recent trends, however, indicate that these hardware assumptions will be invalidated in the near future. First, hardware vendors have started exploring alternate non-cache-coherent shared-memory multi-core designs due to escalating complexity in maintaining coherence across hundreds of cores. Second, as GPGPUs overcome programmability, performance, and interfacing limitations, they are being increasingly adopted by emerging servers to expose heterogeneous parallelism. It is thus necessary to revisit database engine design because current engines can neither deal with the lack of cache coherence nor exploit heterogeneous parallelism. In this paper, we make the case for Heterogeneous-HTAP (H 2 TAP), a new architecture explicitly targeted at emerging hardware. H 2 TAP engines store data in shared memory to maximize data freshness, pair workloads with ideal processor types to exploit heterogeneity, and use message passing with explicit processor cache management to circumvent the lack of cache coherence. Using Caldera, a prototype H 2 TAP engine, we show that the H 2 TAP architecture can be realized in practice and can offer performance competitive with specialized OLTP and OLAP engines
Data Parallel C++
Learn how to accelerate C++ programs using data parallelism. This open access book enables C++ programmers to be at the forefront of this exciting and important new development that is helping to push computing to new levels. It is full of practical advice, detailed explanations, and code examples to illustrate key topics. Data parallelism in C++ enables access to parallel resources in a modern heterogeneous system, freeing you from being locked into any particular computing device. Now a single C++ application can use any combination of devices—including GPUs, CPUs, FPGAs and AI ASICs—that are suitable to the problems at hand. This book begins by introducing data parallelism and foundational topics for effective use of the SYCL standard from the Khronos Group and Data Parallel C++ (DPC++), the open source compiler used in this book. Later chapters cover advanced topics including error handling, hardware-specific programming, communication and synchronization, and memory model considerations. Data Parallel C++ provides you with everything needed to use SYCL for programming heterogeneous systems. What You'll Learn Accelerate C++ programs using data-parallel programming Target multiple device types (e.g. CPU, GPU, FPGA) Use SYCL and SYCL compilers Connect with computing’s heterogeneous future via Intel’s oneAPI initiative Who This Book Is For Those new data-parallel programming and computer programmers interested in data-parallel programming using C++
OpenCL Actors - Adding Data Parallelism to Actor-based Programming with CAF
The actor model of computation has been designed for a seamless support of
concurrency and distribution. However, it remains unspecific about data
parallel program flows, while available processing power of modern many core
hardware such as graphics processing units (GPUs) or coprocessors increases the
relevance of data parallelism for general-purpose computation.
In this work, we introduce OpenCL-enabled actors to the C++ Actor Framework
(CAF). This offers a high level interface for accessing any OpenCL device
without leaving the actor paradigm. The new type of actor is integrated into
the runtime environment of CAF and gives rise to transparent message passing in
distributed systems on heterogeneous hardware. Following the actor logic in
CAF, OpenCL kernels can be composed while encapsulated in C++ actors, hence
operate in a multi-stage fashion on data resident at the GPU. Developers are
thus enabled to build complex data parallel programs from primitives without
leaving the actor paradigm, nor sacrificing performance. Our evaluations on
commodity GPUs, an Nvidia TESLA, and an Intel PHI reveal the expected linear
scaling behavior when offloading larger workloads. For sub-second duties, the
efficiency of offloading was found to largely differ between devices. Moreover,
our findings indicate a negligible overhead over programming with the native
OpenCL API.Comment: 28 page
Enhancing Monte Carlo Particle Transport for Modern Many-Core Architectures
Since near the very beginning of electronic computing, Monte Carlo particle transport has been a fundamental approach for solving computational physics problems. Due to the high computational demands and inherently parallel nature of these applications, Monte Carlo transport applications are often performed in the supercomputing environment. That said, supercomputers are changing, as parallelism has dramatically increased with each supercomputer node, including regular inclusion of many-core devices. Monte Carlo transport, like all applications that run on supercomputers, will be forced to make significant changes to their designs in order to utilize these new architectures effectively. This dissertation presents solutions for central challenges that face Monte Carlo particle transport in this changing environment, specifically in the areas of threading models, tracking algorithms, tally data collection, and heterogenous load balancing. In addition, the dissertation culminates with a study that combines all of the presented techniques in a production application at scale on Lawrence Livermore National Laboratory's RZAnsel Supercomputer
- …