702 research outputs found
Can We Run in Parallel? Automating Loop Parallelization for TornadoVM
With the advent of multi-core systems, GPUs and FPGAs, loop parallelization
has become a promising way to speed-up program execution. In order to stay up
with time, various performance-oriented programming languages provide a
multitude of constructs to allow programmers to write parallelizable loops.
Correspondingly, researchers have developed techniques to automatically
parallelize loops that do not carry dependences across iterations, and/or call
pure functions. However, in managed languages with platform-independent
runtimes such as Java, it is practically infeasible to perform complex
dependence analysis during JIT compilation. In this paper, we propose
AutoTornado, a first of its kind static+JIT loop parallelizer for Java programs
that parallelizes loops for heterogeneous architectures using TornadoVM (a
Graal-based VM that supports insertion of @Parallel constructs for loop
parallelization).
AutoTornado performs sophisticated dependence and purity analysis of Java
programs statically, in the Soot framework, to generate constraints encoding
conditions under which a given loop can be parallelized. The generated
constraints are then fed to the Z3 theorem prover (which we have integrated
with Soot) to annotate canonical for loops that can be parallelized using the
@Parallel construct. We have also added runtime support in TornadoVM to use
static analysis results for loop parallelization. Our evaluation over several
standard parallelization kernels shows that AutoTornado correctly parallelizes
61.3% of manually parallelizable loops, with an efficient static analysis and a
near-zero runtime overhead. To the best of our knowledge, AutoTornado is not
only the first tool that performs program-analysis based parallelization for a
real-world JVM, but also the first to integrate Z3 with Soot for loop
parallelization
Logical Inference Techniques for Loop Parallelization
This paper presents a fully automatic approach to loop parallelization that integrates the use of static and run-time analysis and thus overcomes many known difficulties such as nonlinear and indirect array indexing and complex control flow. Our hybrid analysis framework validates the parallelization transformation by verifying the independence of the loop’s memory references. To this end it represents array references using the USR (uniform set representation) language and expresses the independence condition as an equation, S = ∅, where S is a set expression representing array indexes. Using a language instead of an array-abstraction representation for S results in a smaller number of conservative approximations but exhibits a potentially-high runtime cost. To alleviate this cost we introduce a language translation F from the USR set-expression language to an equally rich language of predicates (F(S) ⇒ S = ∅). Loop parallelization is then validated using a novel logic inference algorithm that factorizes the obtained complex predicates (F(S)) into a sequence of sufficient-independence conditions that are evaluated first statically and, when needed, dynamically, in increasing order of their estimated complexities. We evaluate our automated solution on 26 benchmarks from PERFECT-CLUB and SPEC suites and show that our approach is effective in parallelizing large, complex loops and obtains much better full program speedups than the Intel and IBM Fortran compilers
Recommended from our members
Parallelizing non-vectorizable loops for MIMD machines
Parallelizing a loop for MIMD machines can be described as a process of partitioning it into a number of relatively independent subloops. Previous approaches to partitioning non-vectorizable loops were mainly based on iteration pipelining which partitioned a loop based on iteration number and exploited parallelism by overlapping the execution of iterations. However, the amount of parallelism exploited this way is limited because the parallelism inside iterations has been ignored. In this paper, we present a new loop partitioning technique which can exploit both forms of parallelism - inside and across iterations. While inspired by the VLIW approach, our method is designed for more general, asynchronous, MIMD machines. In particular, our schedule takes the cost of communication into account, and attempts to balance it with respect to parallelism. We show our method is correct, efficient, and produces better schedules than previous iteration level approaches
Recommended from our members
Fine-grain loop scheduling for MIMD machines
Previous algorithms for parallelizing loops on MIMD machines have been based on assigning one or more loop iterations to each processor, introducing synchronization as required. These methods exploit only iteration level parallelism, and ignore the parallelism that may exist at a lower level.In order to exploit parallelism both within and across iterations, our algorithm analyzes and schedules the loop at the statement level. The loop schedule reflects the expected communication and synchronization costs of the target machine. We provide test results that show that this algorithm can produce good speedup of loops on an MIMD machine
- …