7 research outputs found
pocl: A Performance-Portable OpenCL Implementation
OpenCL is a standard for parallel programming of heterogeneous systems. The
benefits of a common programming standard are clear; multiple vendors can
provide support for application descriptions written according to the standard,
thus reducing the program porting effort. While the standard brings the obvious
benefits of platform portability, the performance portability aspects are
largely left to the programmer. The situation is made worse due to multiple
proprietary vendor implementations with different characteristics, and, thus,
required optimization strategies.
In this paper, we propose an OpenCL implementation that is both portable and
performance portable. At its core is a kernel compiler that can be used to
exploit the data parallelism of OpenCL programs on multiple platforms with
different parallel hardware styles. The kernel compiler is modularized to
perform target-independent parallel region formation separately from the
target-specific parallel mapping of the regions to enable support for various
styles of fine-grained parallel resources such as subword SIMD extensions, SIMD
datapaths and static multi-issue. Unlike previous similar techniques that work
on the source level, the parallel region formation retains the information of
the data parallelism using the LLVM IR and its metadata infrastructure. This
data can be exploited by the later generic compiler passes for efficient
parallelization.
The proposed open source implementation of OpenCL is also platform portable,
enabling OpenCL on a wide range of architectures, both already commercialized
and on those that are still under research. The paper describes how the
portability of the implementation is achieved. Our results show that most of
the benchmarked applications when compiled using pocl were faster or close to
as fast as the best proprietary OpenCL implementation for the platform at hand.Comment: This article was published in 2015; it is now openly accessible via
arxi
Whole-function vectorization
Abstract—Data-parallel programming languages are an impor-tant component in today’s parallel computing landscape. Among those are domain-specific languages like shading languages in graphics (HLSL, GLSL, RenderMan, etc.) and “general-purpose” languages like CUDA or OpenCL. Current implementations of those languages on CPUs solely rely on multi-threading to imple-ment parallelism and ignore the additional intra-core parallelism provided by the SIMD instruction set of those processors (like Intel’s SSE and the upcoming AVX or Larrabee instruction sets). In this paper, we discuss several aspects of implementing data-parallel languages on machines with SIMD instruction sets. Our main contribution is a language- and platform-independent code transformation that performs whole-function vectorization on low-level intermediate code given by a control flow graph in SSA form. We evaluate our technique in two scenarios: First, incorpo-rated in a compiler for a domain-specific language used in real-time ray tracing. Second, in a stand-alone OpenCL driver. We observe average speedup factors of 3.9 for the ray tracer and factors between 0.6 and 5.2 for different OpenCL kernels. I
Sequential Optimization of Paths in Directed Graphs Relative to Different Cost Functions
Finding optimal paths in directed graphs is a wide area of research that has received much of attention in theoretical computer science due to its importance in many applications (e.g., computer networks and road maps). Many algorithms have been developed to solve the optimal paths problem with different kinds of graphs. An algorithm that solves the problem of paths’ optimization in directed graphs relative to different cost functions is described in [1]. It follows an approach extended from the dynamic programming approach as it solves the problem sequentially and works on directed graphs with positive weights and no loop edges.
The aim of this thesis is to implement and evaluate that algorithm to find the optimal paths in directed graphs relative to two different cost functions ( , ). A possible interpretation of a directed graph is a network of roads so the weights for the function represent the length of roads, whereas the weights for the function represent a constraint of the width or weight of a vehicle. The optimization aim for those two functions is to minimize the cost relative to the function and maximize the constraint value associated with the function. This thesis also includes finding and proving the relation between the two different cost functions ( , ). When given a value of one function, we can find the best possible value for the other function. This relation is proven theoretically and also implemented and experimented using Matlab®[2]
Future value based single assignment program representations and optimizations
An optimizing compiler internal representation fundamentally affects the clarity, efficiency and feasibility of optimization algorithms employed by the compiler. Static Single Assignment (SSA) as a state-of-the-art program representation has great advantages though still can be improved. This dissertation explores the domain of single assignment beyond SSA, and presents two novel program representations: Future Gated Single Assignment (FGSA) and Recursive Future Predicated Form (RFPF). Both FGSA and RFPF embed control flow and data flow information, enabling efficient traversal program information and thus leading to better and simpler optimizations. We introduce future value concept, the designing base of both FGSA and RFPF, which permits a consumer instruction to be encountered before the producer of its source operand(s) in a control flow setting. We show that FGSA is efficiently computable by using a series T1/T2/TR transformation, yielding an expected linear time algorithm for combining together the construction of the pruned single assignment form and live analysis for both reducible and irreducible graphs. As a result, the approach results in an average reduction of 7.7%, with a maximum of 67% in the number of gating functions compared to the pruned SSA form on the SPEC2000 benchmark suite. We present a solid and near optimal framework to perform inverse transformation from single assignment programs. We demonstrate the importance of unrestricted code motion and present RFPF. We develop algorithms which enable instruction movement in acyclic, as well as cyclic regions, and show the ease to perform optimizations such as Partial Redundancy Elimination on RFPF
Removing and restoring control flow with the Value State Dependence Graph
This thesis studies the practicality of compiling with only data flow information.
Specifically, we focus on the challenges that arise when using the Value
State Dependence Graph (VSDG) as an intermediate representation (IR).
We perform a detailed survey of IRs in the literature in order to discover
trends over time, and we classify them by their features in a taxonomy. We
see how the VSDG fits into the IR landscape, and look at the divide between
academia and the 'real world' in terms of compiler technology. Since most
data flow IRs cannot be constructed for irreducible programs, we perform an
empirical study of irreducibility in current versions of open source software,
and then compare them with older versions of the same software. We also
study machine-generated C code from a variety of different software tools.
We show that irreducibility is no longer a problem, and is becoming less so
with time. We then address the problem of constructing the VSDG. Since
previous approaches in the literature have been poorly documented or ignored
altogether, we give our approach to constructing the VSDG from a common
IR: the Control Flow Graph. We show how our approach is independent of
the source and target language, how it is able to handle unstructured control
flow, and how it is able to transform irreducible programs on the fly. Once the
VSDG is constructed, we implement Lawrence's proceduralisation algorithm
in order to encode an evaluation strategy whilst translating the program into
a parallel representation: the Program Dependence Graph. From here, we
implement scheduling and then code generation using the LLVM compiler.
We compare our compiler framework against several existing compilers, and
show how removing control flow with the VSDG and then restoring it later
can produce high quality code. We also examine specific situations where the
VSDG can put pressure on existing code generators. Our results show that the
VSDG represents a radically different, yet practical, approach to compilation
Waddle - Always-canonical Intermediate Representation
Program transformations that are able to rely on the presence of canonical properties of the program undergoing optimization can be written to be more robust and efficient than an equivalent but generalized transformation that also handles non-canonical programs. If a canonical property is required but broken earlier in an earlier transformation, it must be rebuilt (often from scratch). This additional work can be a dominating factor in compilation time when many transformations are applied over large programs. This dissertation introduces a methodology for constructing program transformations so that the program remains in an always-canonical form as the program is mutated, making only local changes to restore broken properties