2,982 research outputs found
R friendly multi-threading in C++
Calling multi-threaded C++ code from R has its perils. Since the R
interpreter is single-threaded, one must not check for user interruptions or
print to the R console from multiple threads. One can, however, synchronize
with R from the main thread. The R package RcppThread (current version 0.5.3)
contains a header only C++ library for thread safe communication with R that
exploits this fact. It includes C++ classes for threads, a thread pool, and
parallel loops that routinely synchronize with R. This article explains the
package's functionality and gives examples of its usage. The synchronization
mechanism may also apply to other threading frameworks. Benchmarks suggest
that, although synchronization causes overhead, the parallel abstractions of
RcppThread are competitive with other popular libraries in typical scenarios
encountered in statistical computing
Partitioning loops with variable dependence distances
A new technique to parallelize loops,vith variable distance vectors is presented The method extends previous methods in two ways. First, the present method makes it possible for array subscripts to be any linear combination of all loop indices. The solutions to the linear dependence equations established from such army subscripts are characterized by a pseudo distance matrix(PDM). Second, it allows us to exploit loop parallelism from the PDM by applying unimodular and partitioning transformations that preserve the lexicographical order of the dependent iterations. The algorithms to derive the PDM, to find a suitable loop transformation and to generate parallel code are described showing that it is possible to parallelize a wider range of loops automatically
Transformations of High-Level Synthesis Codes for High-Performance Computing
Specialized hardware architectures promise a major step in performance and
energy efficiency over the traditional load/store devices currently employed in
large scale computing systems. The adoption of high-level synthesis (HLS) from
languages such as C/C++ and OpenCL has greatly increased programmer
productivity when designing for such platforms. While this has enabled a wider
audience to target specialized hardware, the optimization principles known from
traditional software design are no longer sufficient to implement
high-performance codes. Fast and efficient codes for reconfigurable platforms
are thus still challenging to design. To alleviate this, we present a set of
optimizing transformations for HLS, targeting scalable and efficient
architectures for high-performance computing (HPC) applications. Our work
provides a toolbox for developers, where we systematically identify classes of
transformations, the characteristics of their effect on the HLS code and the
resulting hardware (e.g., increases data reuse or resource consumption), and
the objectives that each transformation can target (e.g., resolve interface
contention, or increase parallelism). We show how these can be used to
efficiently exploit pipelining, on-chip distributed fast memory, and on-chip
streaming dataflow, allowing for massively parallel architectures. To quantify
the effect of our transformations, we use them to optimize a set of
throughput-oriented FPGA kernels, demonstrating that our enhancements are
sufficient to scale up parallelism within the hardware constraints. With the
transformations covered, we hope to establish a common framework for
performance engineers, compiler developers, and hardware developers, to tap
into the performance potential offered by specialized hardware architectures
using HLS
Array languages and the N-body problem
This paper is a description of the contributions to the SICSA multicore challenge on many body
planetary simulation made by a compiler group at the University of Glasgow. Our group is part of
the Computer Vision and Graphics research group and we have for some years been developing array
compilers because we think these are a good tool both for expressing graphics algorithms and for
exploiting the parallelism that computer vision applications require.
We shall describe experiments using two languages on two different platforms and we shall compare
the performance of these with reference C implementations running on the same platforms. Finally
we shall draw conclusions both about the viability of the array language approach as compared to
other approaches used in the challenge and also about the strengths and weaknesses of the two, very
different, processor architectures we used
- …