6,294 research outputs found
Interoperability of Data Parallel Runtime Libraries with Meta-Chaos
This paper describes a framework for providing the ability to
use multiple specialized data parallel libraries and/or languages
within a single application. The ability to use multiple libraries is
required in many application areas, such as multidisciplinary complex
physical simulations and remote sensing image database applications.
An application can consist of one program or multiple programs that
use different libraries to parallelize operations on distributed data
structures. The framework is embodied in a runtime library called
Meta-Chaos that has been used to exchange data between data parallel
programs written using High Performance Fortran, the Chaos and
Multiblock Parti libraries developed at Maryland for handling various
types of unstructured problems, and the runtime library for pC++, a
data parallel version of C++ from Indiana University. Experimental
results show that Meta-Chaos is able to move data between libraries
efficiently, and that Meta-Chaos provides effective support for
complex applications.
(Also cross-referenced as UMIACS-TR-96-30
Inviwo -- A Visualization System with Usage Abstraction Levels
The complexity of today's visualization applications demands specific
visualization systems tailored for the development of these applications.
Frequently, such systems utilize levels of abstraction to improve the
application development process, for instance by providing a data flow network
editor. Unfortunately, these abstractions result in several issues, which need
to be circumvented through an abstraction-centered system design. Often, a high
level of abstraction hides low level details, which makes it difficult to
directly access the underlying computing platform, which would be important to
achieve an optimal performance. Therefore, we propose a layer structure
developed for modern and sustainable visualization systems allowing developers
to interact with all contained abstraction levels. We refer to this interaction
capabilities as usage abstraction levels, since we target application
developers with various levels of experience. We formulate the requirements for
such a system, derive the desired architecture, and present how the concepts
have been exemplary realized within the Inviwo visualization system.
Furthermore, we address several specific challenges that arise during the
realization of such a layered architecture, such as communication between
different computing platforms, performance centered encapsulation, as well as
layer-independent development by supporting cross layer documentation and
debugging capabilities
Recommended from our members
On the conditions for efficient interoperability with threads: An experience with PGAS languages using Cray communication domains
Today's high performance systems are typically built from shared memory nodes connected by a high speed network. That architecture, combined with the trend towards less memory per core, encourages programmers to use a mixture of message passing and multithreaded programming. Unfortunately, the advantages of using threads for in-node programming are hindered by their inability to efficiently communicate between nodes. In this work, we identify some of the performance problems that arise in such hybrid programming environments and characterize conditions needed to achieve high communication performance for multiple threads: addressability of targets, separability of communication paths, and full direct reachability to targets. Using the GASNet communication layer on the Cray XC30 as our experimental platform, we show how to satisfy these conditions. We also discuss how satisfying these conditions is influenced by the communication abstraction, implementation constraints, and the interconnect messaging capabilities. To evaluate these ideas, we compare the communication performance of a thread-based node runtime to a process-based runtime. Without our GASNet extensions, thread communication is significantly slower than processes - up to 21x slower. Once the implementation is modified to address each of our conditions, the two runtimes have comparable communication performance. This allows programmers to more easily mix models like OpenMP, CILK, or pthreads with a GASNet-based model like UPC, with the associated performance, convenience and interoperability advantages that come from using threads within a node. © 2014 ACM
Recommended from our members
Adressing context-awareness and standards interoperability in e-learning: a service-oriented framework based on IRS III
Current technologies aimed at supporting learning goals primarily follow a data and metadata-centric paradigm. They provide the learner with appropriate learning content packages containing the learning process description as well as the learning resources. Whereas process metadata is usually based on a certain standard specification – such as ADL SCORM or the IMS Learning Design – the used learning resources – data or services - are specific to pre-defined learning contexts, and they are allocated manually at design-time. Therefore, a content package cannot consider the actual learning context, since this is only known at runtime of a learning process. These facts limit the reusability of a content package across different standards and contexts. To overcome these issues, this paper proposes an innovative Semantic Web Service-based approach that changes this data- and metadata-based paradigm to a context-adaptive service-oriented approach. In this approach, the learning process is semantically described as a standard-independent process model decomposed into several learning goals. These goals are accomplished at runtime, based on the automatic allocation of the most appropriate service. As a result, we address the dynamic adaptation to specific context and - providing the appropriate mappings to established metadata standards - we enable the reuse of the defined semantic learning process model across different standards. To illustrate the application of our approach and to prove its feasibility, a prototypical application based on an initial use case scenario is proposed
Recommended from our members
Preparing sparse solvers for exascale computing.
Sparse solvers provide essential functionality for a wide variety of scientific applications. Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi-scale simulations, especially as we target exascale platforms. This paper describes the challenges, strategies and progress of the US Department of Energy Exascale Computing project towards providing sparse solvers for exascale computing platforms. We address the demands of systems with thousands of high-performance node devices where exposing concurrency, hiding latency and creating alternative algorithms become essential. The efforts described here are works in progress, highlighting current success and upcoming challenges. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'
Evaluation of Directive-Based GPU Programming Models on a Block Eigensolver with Consideration of Large Sparse Matrices
Achieving high performance and performance portability for large-scale scientific applications is a major challenge on heterogeneous computing systems such as many-core CPUs and accelerators like GPUs. In this work, we implement a widely used block eigensolver, Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG), using two popular directive based programming models (OpenMP and OpenACC) for GPU-accelerated systems. Our work differs from existing work in that it adopts a holistic approach that optimizes the full solver performance rather than narrowing the problem into small kernels (e.g., SpMM, SpMV). Our LOPBCG GPU implementation achieves a 2.8–4.3 speedup over an optimized CPU implementation when tested with four different input matrices. The evaluated configuration compared one Skylake CPU to one Skylake CPU and one NVIDIA V100 GPU. Our OpenMP and OpenACC LOBPCG GPU implementations gave nearly identical performance. We also consider how to create an efficient LOBPCG solver that can solve problems larger than GPU memory capacity. To this end, we create microbenchmarks representing the two dominant kernels (inner product and SpMM kernel) in LOBPCG and then evaluate performance when using two different programming approaches: tiling the kernels, and using Unified Memory with the original kernels. Our tiled SpMM implementation achieves a 2.9 and 48.2 speedup over the Unified Memory implementation on supercomputers with PCIe Gen3 and NVLink 2.0 CPU to GPU interconnects, respectively
- …