1,099 research outputs found
goSLP: Globally Optimized Superword Level Parallelism Framework
Modern microprocessors are equipped with single instruction multiple data
(SIMD) or vector instruction sets which allow compilers to exploit superword
level parallelism (SLP), a type of fine-grained parallelism. Current SLP
auto-vectorization techniques use heuristics to discover vectorization
opportunities in high-level language code. These heuristics are fragile, local
and typically only present one vectorization strategy that is either accepted
or rejected by a cost model. We present goSLP, a novel SLP auto-vectorization
framework which solves the statement packing problem in a pairwise optimal
manner. Using an integer linear programming (ILP) solver, goSLP searches the
entire space of statement packing opportunities for a whole function at a time,
while limiting total compilation time to a few minutes. Furthermore, goSLP
optimally solves the vector permutation selection problem using dynamic
programming. We implemented goSLP in the LLVM compiler infrastructure,
achieving a geometric mean speedup of 7.58% on SPEC2017fp, 2.42% on SPEC2006fp
and 4.07% on NAS benchmarks compared to LLVM's existing SLP auto-vectorizer.Comment: Published at OOPSLA 201
pocl: A Performance-Portable OpenCL Implementation
OpenCL is a standard for parallel programming of heterogeneous systems. The
benefits of a common programming standard are clear; multiple vendors can
provide support for application descriptions written according to the standard,
thus reducing the program porting effort. While the standard brings the obvious
benefits of platform portability, the performance portability aspects are
largely left to the programmer. The situation is made worse due to multiple
proprietary vendor implementations with different characteristics, and, thus,
required optimization strategies.
In this paper, we propose an OpenCL implementation that is both portable and
performance portable. At its core is a kernel compiler that can be used to
exploit the data parallelism of OpenCL programs on multiple platforms with
different parallel hardware styles. The kernel compiler is modularized to
perform target-independent parallel region formation separately from the
target-specific parallel mapping of the regions to enable support for various
styles of fine-grained parallel resources such as subword SIMD extensions, SIMD
datapaths and static multi-issue. Unlike previous similar techniques that work
on the source level, the parallel region formation retains the information of
the data parallelism using the LLVM IR and its metadata infrastructure. This
data can be exploited by the later generic compiler passes for efficient
parallelization.
The proposed open source implementation of OpenCL is also platform portable,
enabling OpenCL on a wide range of architectures, both already commercialized
and on those that are still under research. The paper describes how the
portability of the implementation is achieved. Our results show that most of
the benchmarked applications when compiled using pocl were faster or close to
as fast as the best proprietary OpenCL implementation for the platform at hand.Comment: This article was published in 2015; it is now openly accessible via
arxi
Recommended from our members
Automatic data/program partitioning using the single assignment principle
Loosely-coupled MIMD architectures do not suffer from memory contention; hence large numbers of processors may be utilized. The main problem, however, is how to partition data and programs in order to exploit the available parallelism. In this paper we show that efficient schemes for automatic data/program partitioning and synchronization may be employed if single assignment is used. Using simulations of program loops common to scientific computations (the Livermore Loops), we demonstrate that only a small fraction of data accesses are remote and thus the degradation in network performance due to multiprocessing is minimal
Memory and Parallelism Analysis Using a Platform-Independent Approach
Emerging computing architectures such as near-memory computing (NMC) promise
improved performance for applications by reducing the data movement between CPU
and memory. However, detecting such applications is not a trivial task. In this
ongoing work, we extend the state-of-the-art platform-independent software
analysis tool with NMC related metrics such as memory entropy, spatial
locality, data-level, and basic-block-level parallelism. These metrics help to
identify the applications more suitable for NMC architectures.Comment: 22nd ACM International Workshop on Software and Compilers for
Embedded Systems (SCOPES '19), May 201
The Janus triad: Exploiting parallelism through dynamic binary modification
We present a unified approach for exploiting thread-level, data-level, and memory-level parallelism through a same-ISA dynamic binary modifier guided by static binary analysis. A static binary analyser first examines an executable and determines the operations required to extract parallelism at runtime, encoding them as a series of rewrite rules that a dynamic binary modifier uses to perform binary transformation. We demonstrate this framework by exploiting three different kinds of parallelism to perform automatic vectorisation, software prefetching, and automatic parallelisation together on legacy application binaries. Software prefetch insertion alone achieves an average speedup of 1.2Ă—, comparing favourably with an automatic compiler pass. Automatic vectorisation brings speedups of 2.7Ă— on the TSVC benchmarks, significantly beating a compiler approach for some workloads. Finally, combining prefetching, vectorisation, and parallelisation realises a speedup of 3.8Ă— on a representative application loop
CAREER: Automated software understanding for retargeting embedded image processing software for data parallel execution
Issued as final reportNational Science Foundation (U.S.
Optimizing SIMD execution in HW/SW co-designed processors
SIMD accelerators are ubiquitous in microprocessors from different computing domains. Their high compute power and hardware simplicity improve overall performance in an energy efficient manner. Moreover, their replicated functional units and simple control mechanism make them amenable to scaling to higher vector lengths. However, code generation for these accelerators has been a challenge from the days of their inception. Compilers generate vector code conservatively to ensure correctness. As a result they lose significant vectorization opportunities and fail to extract maximum benefits out of SIMD accelerators.
This thesis proposes to vectorize the program binary at runtime in a speculative manner, in addition to the compile time static vectorization. There are different environments that support runtime profiling and optimization support required for dynamic vectorization, one of most prominent ones being: 1) Dynamic Binary Translators and Optimizers (DBTO) and 2) Hardware/Software (HW/SW) Co-designed Processors. HW/SW co-designed environment provides several advantages over DBTOs like transparent incorporations of new hardware features, binary compatibility, etc. Therefore, we use HW/SW co-designed environment to assess the potential of speculative dynamic vectorization.
Furthermore, we analyze vector code generation for wider vector units and find out that even though SIMD accelerators are amenable to scaling from the hardware point of view, vector code generation at higher vector length is even more challenging. The two major factors impeding vectorization for wider SIMD units are: 1) Reduced dynamic instruction stream coverage for vectorization and 2) Large number of permutation instructions. To solve the first problem we propose Variable Length Vectorization that iteratively vectorizes for multiple vector lengths to improve dynamic instruction stream coverage. Secondly, to reduce the number of permutation instructions we propose Selective Writing that selectively writes to different parts of a vector register and avoids permutations.
Finally, we tackle the problem of leakage energy in SIMD accelerators. Since SIMD accelerators consume significant amount of real estate on the chip, they become the principle source of leakage if not utilized judiciously. Power gating is one of the most widely used techniques to reduce leakage energy of functional units. However, power gating has its own energy and performance overhead associated with it. We propose to selectively devectorize the vector code when higher SIMD lanes are used intermittently. This selective devectorization keeps the higher SIMD lanes idle and power gated for maximum duration. Therefore, resulting in overall leakage energy reduction.Postprint (published version
Evaluating kernels on Xeon Phi to accelerate Gysela application
This work describes the challenges presented by porting parts ofthe Gysela
code to the Intel Xeon Phi coprocessor, as well as techniques used for
optimization, vectorization and tuning that can be applied to other
applications. We evaluate the performance of somegeneric micro-benchmark on Phi
versus Intel Sandy Bridge. Several interpolation kernels useful for the Gysela
application are analyzed and the performance are shown. Some memory-bound and
compute-bound kernels are accelerated by a factor 2 on the Phi device compared
to Sandy architecture. Nevertheless, it is hard, if not impossible, to reach a
large fraction of the peek performance on the Phi device,especially for
real-life applications as Gysela. A collateral benefit of this optimization and
tuning work is that the execution time of Gysela (using 4D advections) has
decreased on a standard architecture such as Intel Sandy Bridge.Comment: submitted to ESAIM proceedings for CEMRACS 2014 summer school version
reviewe
- …