3,164 research outputs found
Using shared-data localization to reduce the cost of inspector-execution in unified-parallel-C programs
Programs written in the Unified Parallel C (UPC) language can access any location of the entire local and remote address space via read/write operations. However, UPC programs that contain fine-grained shared accesses can exhibit performance degradation. One solution is to use the inspector-executor technique to coalesce fine-grained shared accesses to larger remote access operations. A straightforward implementation of the inspector executor transformation results in excessive instrumentation that hinders performance.; This paper addresses this issue and introduces various techniques that aim at reducing the generated instrumentation code: a shared-data localization transformation based on Constant-Stride Linear Memory Descriptors (CSLMADs) [S. Aarseth, Gravitational N-Body Simulations: Tools and Algorithms, Cambridge Monographs on Mathematical Physics, Cambridge University Press, 2003.], the inlining of data locality checks and the usage of an index vector to aggregate the data. Finally, the paper introduces a lightweight loop code motion transformation to privatize shared scalars that were propagated through the loop body.; A performance evaluation, using up to 2048 cores of a POWER 775, explores the impact of each optimization and characterizes the overheads of UPC programs. It also shows that the presented optimizations increase performance of UPC programs up to 1.8 x their UPC hand-optimized counterpart for applications with regular accesses and up to 6.3 x for applications with irregular accesses.Peer ReviewedPostprint (author's final draft
The Potential of Synergistic Static, Dynamic and Speculative Loop Nest Optimizations for Automatic Parallelization
Research in automatic parallelization of loop-centric programs started with
static analysis, then broadened its arsenal to include dynamic
inspection-execution and speculative execution, the best results involving
hybrid static-dynamic schemes. Beyond the detection of parallelism in a
sequential program, scalable parallelization on many-core processors involves
hard and interesting parallelism adaptation and mapping challenges. These
challenges include tailoring data locality to the memory hierarchy, structuring
independent tasks hierarchically to exploit multiple levels of parallelism,
tuning the synchronization grain, balancing the execution load, decoupling the
execution into thread-level pipelines, and leveraging heterogeneous hardware
with specialized accelerators. The polyhedral framework allows to model,
construct and apply very complex loop nest transformations addressing most of
the parallelism adaptation and mapping challenges. But apart from
hardware-specific, back-end oriented transformations (if-conversion, trace
scheduling, value prediction), loop nest optimization has essentially ignored
dynamic and speculative techniques. Research in polyhedral compilation recently
reached a significant milestone towards the support of dynamic, data-dependent
control flow. This opens a large avenue for blending dynamic analyses and
speculative techniques with advanced loop nest optimizations. Selecting
real-world examples from SPEC benchmarks and numerical kernels, we make a case
for the design of synergistic static, dynamic and speculative loop
transformation techniques. We also sketch the embedding of dynamic information,
including speculative assumptions, in the heart of affine transformation search
spaces
Towards co-designed optimizations in parallel frameworks: A MapReduce case study
The explosion of Big Data was followed by the proliferation of numerous
complex parallel software stacks whose aim is to tackle the challenges of data
deluge. A drawback of a such multi-layered hierarchical deployment is the
inability to maintain and delegate vital semantic information between layers in
the stack. Software abstractions increase the semantic distance between an
application and its generated code. However, parallel software frameworks
contain inherent semantic information that general purpose compilers are not
designed to exploit.
This paper presents a case study demonstrating how the specific semantic
information of the MapReduce paradigm can be exploited on multicore
architectures. MR4J has been implemented in Java and evaluated against
hand-optimized C and C++ equivalents. The initial observed results led to the
design of a semantically aware optimizer that runs automatically without
requiring modification to application code.
The optimizer is able to speedup the execution time of MR4J by up to 2.0x.
The introduced optimization not only improves the performance of the generated
code, during the map phase, but also reduces the pressure on the garbage
collector. This demonstrates how semantic information can be harnessed without
sacrificing sound software engineering practices when using parallel software
frameworks.Comment: 8 page
Speculative Staging for Interpreter Optimization
Interpreters have a bad reputation for having lower performance than
just-in-time compilers. We present a new way of building high performance
interpreters that is particularly effective for executing dynamically typed
programming languages. The key idea is to combine speculative staging of
optimized interpreter instructions with a novel technique of incrementally and
iteratively concerting them at run-time.
This paper introduces the concepts behind deriving optimized instructions
from existing interpreter instructions---incrementally peeling off layers of
complexity. When compiling the interpreter, these optimized derivatives will be
compiled along with the original interpreter instructions. Therefore, our
technique is portable by construction since it leverages the existing
compiler's backend. At run-time we use instruction substitution from the
interpreter's original and expensive instructions to optimized instruction
derivatives to speed up execution.
Our technique unites high performance with the simplicity and portability of
interpreters---we report that our optimization makes the CPython interpreter up
to more than four times faster, where our interpreter closes the gap between
and sometimes even outperforms PyPy's just-in-time compiler.Comment: 16 pages, 4 figures, 3 tables. Uses CPython 3.2.3 and PyPy 1.
A Survey on Compiler Autotuning using Machine Learning
Since the mid-1990s, researchers have been trying to use machine-learning
based approaches to solve a number of different compiler optimization problems.
These techniques primarily enhance the quality of the obtained results and,
more importantly, make it feasible to tackle two main compiler optimization
problems: optimization selection (choosing which optimizations to apply) and
phase-ordering (choosing the order of applying optimizations). The compiler
optimization space continues to grow due to the advancement of applications,
increasing number of compiler optimizations, and new target architectures.
Generic optimization passes in compilers cannot fully leverage newly introduced
optimizations and, therefore, cannot keep up with the pace of increasing
options. This survey summarizes and classifies the recent advances in using
machine learning for the compiler optimization field, particularly on the two
major problems of (1) selecting the best optimizations and (2) the
phase-ordering of optimizations. The survey highlights the approaches taken so
far, the obtained results, the fine-grain classification among different
approaches and finally, the influential papers of the field.Comment: version 5.0 (updated on September 2018)- Preprint Version For our
Accepted Journal @ ACM CSUR 2018 (42 pages) - This survey will be updated
quarterly here (Send me your new published papers to be added in the
subsequent version) History: Received November 2016; Revised August 2017;
Revised February 2018; Accepted March 2018
- …