11,798 research outputs found
pocl: A Performance-Portable OpenCL Implementation
OpenCL is a standard for parallel programming of heterogeneous systems. The
benefits of a common programming standard are clear; multiple vendors can
provide support for application descriptions written according to the standard,
thus reducing the program porting effort. While the standard brings the obvious
benefits of platform portability, the performance portability aspects are
largely left to the programmer. The situation is made worse due to multiple
proprietary vendor implementations with different characteristics, and, thus,
required optimization strategies.
In this paper, we propose an OpenCL implementation that is both portable and
performance portable. At its core is a kernel compiler that can be used to
exploit the data parallelism of OpenCL programs on multiple platforms with
different parallel hardware styles. The kernel compiler is modularized to
perform target-independent parallel region formation separately from the
target-specific parallel mapping of the regions to enable support for various
styles of fine-grained parallel resources such as subword SIMD extensions, SIMD
datapaths and static multi-issue. Unlike previous similar techniques that work
on the source level, the parallel region formation retains the information of
the data parallelism using the LLVM IR and its metadata infrastructure. This
data can be exploited by the later generic compiler passes for efficient
parallelization.
The proposed open source implementation of OpenCL is also platform portable,
enabling OpenCL on a wide range of architectures, both already commercialized
and on those that are still under research. The paper describes how the
portability of the implementation is achieved. Our results show that most of
the benchmarked applications when compiled using pocl were faster or close to
as fast as the best proprietary OpenCL implementation for the platform at hand.Comment: This article was published in 2015; it is now openly accessible via
arxi
Less is More: Exploiting the Standard Compiler Optimization Levels for Better Performance and Energy Consumption
This paper presents the interesting observation that by performing fewer of
the optimizations available in a standard compiler optimization level such as
-O2, while preserving their original ordering, significant savings can be
achieved in both execution time and energy consumption. This observation has
been validated on two embedded processors, namely the ARM Cortex-M0 and the ARM
Cortex-M3, using two different versions of the LLVM compilation framework; v3.8
and v5.0. Experimental evaluation with 71 embedded benchmarks demonstrated
performance gains for at least half of the benchmarks for both processors. An
average execution time reduction of 2.4% and 5.3% was achieved across all the
benchmarks for the Cortex-M0 and Cortex-M3 processors, respectively, with
execution time improvements ranging from 1% up to 90% over the -O2. The savings
that can be achieved are in the same range as what can be achieved by the
state-of-the-art compilation approaches that use iterative compilation or
machine learning to select flags or to determine phase orderings that result in
more efficient code. In contrast to these time consuming and expensive to apply
techniques, our approach only needs to test a limited number of optimization
configurations, less than 64, to obtain similar or even better savings.
Furthermore, our approach can support multi-criteria optimization as it targets
execution time, energy consumption and code size at the same time.Comment: 15 pages, 3 figures, 71 benchmarks used for evaluatio
Lost in translation: Exposing hidden compiler optimization opportunities
Existing iterative compilation and machine-learning-based optimization
techniques have been proven very successful in achieving better optimizations
than the standard optimization levels of a compiler. However, they were not
engineered to support the tuning of a compiler's optimizer as part of the
compiler's daily development cycle. In this paper, we first establish the
required properties which a technique must exhibit to enable such tuning. We
then introduce an enhancement to the classic nightly routine testing of
compilers which exhibits all the required properties, and thus, is capable of
driving the improvement and tuning of the compiler's common optimizer. This is
achieved by leveraging resource usage and compilation information collected
while systematically exploiting prefixes of the transformations applied at
standard optimization levels. Experimental evaluation using the LLVM v6.0.1
compiler demonstrated that the new approach was able to reveal hidden
cross-architecture and architecture-dependent potential optimizations on two
popular processors: the Intel i5-6300U and the Arm Cortex-A53-based Broadcom
BCM2837 used in the Raspberry Pi 3B+. As a case study, we demonstrate how the
insights from our approach enabled us to identify and remove a significant
shortcoming of the CFG simplification pass of the LLVM v6.0.1 compiler.Comment: 31 pages, 7 figures, 2 table. arXiv admin note: text overlap with
arXiv:1802.0984
Implementing Multi-Periodic Critical Systems: from Design to Code Generation
This article presents a complete scheme for the development of Critical
Embedded Systems with Multiple Real-Time Constraints. The system is programmed
with a language that extends the synchronous approach with high-level real-time
primitives. It enables to assemble in a modular and hierarchical manner several
locally mono-periodic synchronous systems into a globally multi-periodic
synchronous system. It also allows to specify flow latency constraints. A
program is translated into a set of real-time tasks. The generated code (\C\
code) can be executed on a simple real-time platform with a dynamic-priority
scheduler (EDF). The compilation process (each algorithm of the process, not
the compiler itself) is formally proved correct, meaning that the generated
code respects the real-time semantics of the original program (respect of
periods, deadlines, release dates and precedences) as well as its functional
semantics (respect of variable consumption).Comment: 15 pages, published in Workshop on Formal Methods for Aerospace
(FMA'09), part of Formal Methods Week 2009
Deductive Optimization of Relational Data Storage
Optimizing the physical data storage and retrieval of data are two key
database management problems. In this paper, we propose a language that can
express a wide range of physical database layouts, going well beyond the row-
and column-based methods that are widely used in database management systems.
We use deductive synthesis to turn a high-level relational representation of a
database query into a highly optimized low-level implementation which operates
on a specialized layout of the dataset. We build a compiler for this language
and conduct experiments using a popular database benchmark, which shows that
the performance of these specialized queries is competitive with a
state-of-the-art in memory compiled database system
Development of an MSC language and compiler, volume 1
Higher order programming language and compiler for advanced computer software system to be used with manned space flights between 1972 and 198
- …