4,611 research outputs found
Recommended from our members
Computer-aided programming for multiprocessing systems
As both the number of processors and the complexity of problems to be solved increase, programming multiprocessing systems becomes more difficult and error-prone. This report discusses parallel models of computation and tools for computer-aided programming (CAP). Program development tools are necessary since programmers are not able to develop complex parallel programs efficiently. In particular, a CAP tool, named Hypertool, is described here. It performs scheduling and handles the communication primitive insertion automatically so that many errors are eliminated. It also generates the performance estimates and other program quality measures to help programmers in improving their algorithms and programs. Experiments have shown that up to a 300% performance improvement can be achieved by computer-aided programming
GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems
While many of the architectural details of future exascale-class high
performance computer systems are still a matter of intense research, there
appears to be a general consensus that they will be strongly heterogeneous,
featuring "standard" as well as "accelerated" resources. Today, such resources
are available as multicore processors, graphics processing units (GPUs), and
other accelerators such as the Intel Xeon Phi. Any software infrastructure that
claims usefulness for such environments must be able to meet their inherent
challenges: massive multi-level parallelism, topology, asynchronicity, and
abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a
collection of building blocks that targets algorithms dealing with sparse
matrix representations on current and future large-scale systems. It implements
the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel
numerical kernels, intelligent resource management, and truly heterogeneous
parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We
describe the details of its design with respect to the challenges posed by
modern heterogeneous supercomputers and recent algorithmic developments.
Implementation details which are indispensable for achieving high efficiency
are pointed out and their necessity is justified by performance measurements or
predictions based on performance models. The library code and several
applications are available as open source. We also provide instructions on how
to make use of GHOST in existing software packages, together with a case study
which demonstrates the applicability and performance of GHOST as a component
within a larger software stack.Comment: 32 pages, 11 figure
FormCalc 8: Better Algebra and Vectorization
We present Version 8 of the Feynman-diagram calculator FormCalc. New features
include in particular significantly improved algebraic simplification as well
as vectorization of the generated code. The Cuba Library, used in FormCalc,
features checkpointing to disk for all integration algorithms.Comment: 7 pages, LaTeX, proceedings contribution to ACAT 2013, Beijing,
China, 16-21 May 201
Polly's Polyhedral Scheduling in the Presence of Reductions
The polyhedral model provides a powerful mathematical abstraction to enable
effective optimization of loop nests with respect to a given optimization goal,
e.g., exploiting parallelism. Unexploited reduction properties are a frequent
reason for polyhedral optimizers to assume parallelism prohibiting dependences.
To our knowledge, no polyhedral loop optimizer available in any production
compiler provides support for reductions. In this paper, we show that
leveraging the parallelism of reductions can lead to a significant performance
increase. We give a precise, dependence based, definition of reductions and
discuss ways to extend polyhedral optimization to exploit the associativity and
commutativity of reduction computations. We have implemented a
reduction-enabled scheduling approach in the Polly polyhedral optimizer and
evaluate it on the standard Polybench 3.2 benchmark suite. We were able to
detect and model all 52 arithmetic reductions and achieve speedups up to
2.21 on a quad core machine by exploiting the multidimensional
reduction in the BiCG benchmark.Comment: Presented at the IMPACT15 worksho
New Algebraic Formulation of Density Functional Calculation
This article addresses a fundamental problem faced by the ab initio
community: the lack of an effective formalism for the rapid exploration and
exchange of new methods. To rectify this, we introduce a novel, basis-set
independent, matrix-based formulation of generalized density functional
theories which reduces the development, implementation, and dissemination of
new ab initio techniques to the derivation and transcription of a few lines of
algebra. This new framework enables us to concisely demystify the inner
workings of fully functional, highly efficient modern ab initio codes and to
give complete instructions for the construction of such for calculations
employing arbitrary basis sets. Within this framework, we also discuss in full
detail a variety of leading-edge ab initio techniques, minimization algorithms,
and highly efficient computational kernels for use with scalar as well as
shared and distributed-memory supercomputer architectures
Teaching Parallel Programming Using Java
This paper presents an overview of the "Applied Parallel Computing" course
taught to final year Software Engineering undergraduate students in Spring 2014
at NUST, Pakistan. The main objective of the course was to introduce practical
parallel programming tools and techniques for shared and distributed memory
concurrent systems. A unique aspect of the course was that Java was used as the
principle programming language. The course was divided into three sections. The
first section covered parallel programming techniques for shared memory systems
that include multicore and Symmetric Multi-Processor (SMP) systems. In this
section, Java threads was taught as a viable programming API for such systems.
The second section was dedicated to parallel programming tools meant for
distributed memory systems including clusters and network of computers. We used
MPJ Express-a Java MPI library-for conducting programming assignments and lab
work for this section. The third and the final section covered advanced topics
including the MapReduce programming model using Hadoop and the General Purpose
Computing on Graphics Processing Units (GPGPU).Comment: 8 Pages, 6 figures, MPJ Express, MPI Java, Teaching Parallel
Programmin
Parallel structurally-symmetric sparse matrix-vector products on multi-core processors
We consider the problem of developing an efficient multi-threaded
implementation of the matrix-vector multiplication algorithm for sparse
matrices with structural symmetry. Matrices are stored using the compressed
sparse row-column format (CSRC), designed for profiting from the symmetric
non-zero pattern observed in global finite element matrices. Unlike classical
compressed storage formats, performing the sparse matrix-vector product using
the CSRC requires thread-safe access to the destination vector. To avoid race
conditions, we have implemented two partitioning strategies. In the first one,
each thread allocates an array for storing its contributions, which are later
combined in an accumulation step. We analyze how to perform this accumulation
in four different ways. The second strategy employs a coloring algorithm for
grouping rows that can be concurrently processed by threads. Our results
indicate that, although incurring an increase in the working set size, the
former approach leads to the best performance improvements for most matrices.Comment: 17 pages, 17 figures, reviewed related work section, fixed typo
- …