29,114 research outputs found
OpenCL Actors - Adding Data Parallelism to Actor-based Programming with CAF
The actor model of computation has been designed for a seamless support of
concurrency and distribution. However, it remains unspecific about data
parallel program flows, while available processing power of modern many core
hardware such as graphics processing units (GPUs) or coprocessors increases the
relevance of data parallelism for general-purpose computation.
In this work, we introduce OpenCL-enabled actors to the C++ Actor Framework
(CAF). This offers a high level interface for accessing any OpenCL device
without leaving the actor paradigm. The new type of actor is integrated into
the runtime environment of CAF and gives rise to transparent message passing in
distributed systems on heterogeneous hardware. Following the actor logic in
CAF, OpenCL kernels can be composed while encapsulated in C++ actors, hence
operate in a multi-stage fashion on data resident at the GPU. Developers are
thus enabled to build complex data parallel programs from primitives without
leaving the actor paradigm, nor sacrificing performance. Our evaluations on
commodity GPUs, an Nvidia TESLA, and an Intel PHI reveal the expected linear
scaling behavior when offloading larger workloads. For sub-second duties, the
efficiency of offloading was found to largely differ between devices. Moreover,
our findings indicate a negligible overhead over programming with the native
OpenCL API.Comment: 28 page
Improving the Performance and Energy Efficiency of GPGPU Computing through Adaptive Cache and Memory Management Techniques
Department of Computer Science and EngineeringAs the performance and energy efficiency requirement of GPGPUs have risen, memory management techniques of GPGPUs have improved to meet the requirements by employing hardware caches and utilizing heterogeneous memory. These techniques can improve GPGPUs by providing lower latency and higher bandwidth of the memory. However, these methods do not always guarantee improved performance and energy efficiency due to the small cache size and heterogeneity of the memory nodes. While prior works have proposed various techniques to address this issue, relatively little work has been done to investigate holistic support for memory management techniques.
In this dissertation, we analyze performance pathologies and propose various techniques to improve memory management techniques. First, we investigate the effectiveness of advanced cache indexing (ACI) for high-performance and energy-efficient GPGPU computing. Specifically, we discuss the designs of various static and adaptive cache indexing schemes and present implementation for GPGPUs. We then quantify and analyze the effectiveness of the ACI schemes based on a cycle-accurate GPGPU simulator. Our quantitative evaluation shows that ACI schemes achieve significant performance and energy-efficiency gains over baseline conventional indexing scheme. We also analyze the performance sensitivity of ACI to key architectural parameters (i.e., capacity, associativity, and ICN bandwidth) and the cache indexing latency. We also demonstrate that ACI continues to achieve high performance in various settings.
Second, we propose IACM, integrated adaptive cache management for high-performance and energy-efficient GPGPU computing. Based on the performance pathology analysis of GPGPUs, we integrate state-of-the-art adaptive cache management techniques (i.e., cache indexing, bypassing, and warp limiting) in a unified architectural framework to eliminate performance pathologies. Our quantitative evaluation demonstrates that IACM significantly improves the performance and energy efficiency of various GPGPU workloads over the baseline architecture (i.e., 98.1% and 61.9% on average, respectively) and achieves considerably higher performance than the state-of-the-art technique (i.e., 361.4% at maximum and 7.7% on average). Furthermore, IACM delivers significant performance and energy efficiency gains over the baseline GPGPU architecture even when enhanced with advanced architectural technologies (e.g., higher capacity, associativity).
Third, we propose bandwidth- and latency-aware page placement (BLPP) for GPGPUs with heterogeneous memory. BLPP analyzes the characteristics of a application and determines the optimal page allocation ratio between the GPU and CPU memory. Based on the optimal page allocation ratio, BLPP dynamically allocate pages across the heterogeneous memory nodes. Our experimental results show that BLPP considerably outperforms the baseline and state-of-the-art technique (i.e., 13.4% and 16.7%) and performs similar to the static-best version (i.e., 1.2% difference), which requires extensive offline profiling.clos
Practical Sparse Matrices in C++ with Hybrid Storage and Template-Based Expression Optimisation
Despite the importance of sparse matrices in numerous fields of science,
software implementations remain difficult to use for non-expert users,
generally requiring the understanding of underlying details of the chosen
sparse matrix storage format. In addition, to achieve good performance, several
formats may need to be used in one program, requiring explicit selection and
conversion between the formats. This can be both tedious and error-prone,
especially for non-expert users. Motivated by these issues, we present a
user-friendly and open-source sparse matrix class for the C++ language, with a
high-level application programming interface deliberately similar to the widely
used MATLAB language. This facilitates prototyping directly in C++ and aids the
conversion of research code into production environments. The class internally
uses two main approaches to achieve efficient execution: (i) a hybrid storage
framework, which automatically and seamlessly switches between three underlying
storage formats (compressed sparse column, Red-Black tree, coordinate list)
depending on which format is best suited and/or available for specific
operations, and (ii) a template-based meta-programming framework to
automatically detect and optimise execution of common expression patterns.
Empirical evaluations on large sparse matrices with various densities of
non-zero elements demonstrate the advantages of the hybrid storage framework
and the expression optimisation mechanism.Comment: extended and revised version of an earlier conference paper
arXiv:1805.0338
Practical Sparse Matrices in C++ with Hybrid Storage and Template-Based Expression Optimisation
Despite the importance of sparse matrices in numerous fields of science,
software implementations remain difficult to use for non-expert users,
generally requiring the understanding of underlying details of the chosen
sparse matrix storage format. In addition, to achieve good performance, several
formats may need to be used in one program, requiring explicit selection and
conversion between the formats. This can be both tedious and error-prone,
especially for non-expert users. Motivated by these issues, we present a
user-friendly and open-source sparse matrix class for the C++ language, with a
high-level application programming interface deliberately similar to the widely
used MATLAB language. This facilitates prototyping directly in C++ and aids the
conversion of research code into production environments. The class internally
uses two main approaches to achieve efficient execution: (i) a hybrid storage
framework, which automatically and seamlessly switches between three underlying
storage formats (compressed sparse column, Red-Black tree, coordinate list)
depending on which format is best suited and/or available for specific
operations, and (ii) a template-based meta-programming framework to
automatically detect and optimise execution of common expression patterns.
Empirical evaluations on large sparse matrices with various densities of
non-zero elements demonstrate the advantages of the hybrid storage framework
and the expression optimisation mechanism.Comment: extended and revised version of an earlier conference paper
arXiv:1805.0338
SICStus MT - A Multithreaded Execution Environment for SICStus Prolog
The development of intelligent software agents and other
complex applications which continuously interact with their
environments has been one of the reasons why explicit concurrency has
become a necessity in a modern Prolog system today. Such applications
need to perform several tasks which may be very different with respect
to how they are implemented in Prolog. Performing these tasks
simultaneously is very tedious without language support.
This paper describes the design, implementation and evaluation of a
prototype multithreaded execution environment for SICStus Prolog. The
threads are dynamically managed using a small and compact set of
Prolog primitives implemented in a portable way, requiring almost no
support from the underlying operating system
A Study of Energy and Locality Effects using Space-filling Curves
The cost of energy is becoming an increasingly important driver for the
operating cost of HPC systems, adding yet another facet to the challenge of
producing efficient code. In this paper, we investigate the energy implications
of trading computation for locality using Hilbert and Morton space-filling
curves with dense matrix-matrix multiplication. The advantage of these curves
is that they exhibit an inherent tiling effect without requiring specific
architecture tuning. By accessing the matrices in the order determined by the
space-filling curves, we can trade computation for locality. The index
computation overhead of the Morton curve is found to be balanced against its
locality and energy efficiency, while the overhead of the Hilbert curve
outweighs its improvements on our test system.Comment: Proceedings of the 2014 IEEE International Parallel & Distributed
Processing Symposium Workshops (IPDPSW
- …