2,897 research outputs found
Python Programmers Have GPUs Too: Automatic Python Loop Parallelization with Staged Dependence Analysis
Python is a popular language for end-user software development in many application domains. End-users want to harness parallel compute resources effectively, by exploiting commodity manycore technology including GPUs. However, existing approaches to parallelism in Python are esoteric, and generally seem too complex for the typical end-user developer. We argue that implicit, or automatic, parallelization is the best way to deliver the benefits of manycore to end-users, since it avoids domain-specific languages, specialist libraries, complex annotations or restrictive language subsets. Auto-parallelization fits the Python philosophy, provides effective performance, and is convenient for non-expert developers.
Despite being a dynamic language, we show that Python is a suitable target for auto-parallelization. In an empirical study of 3000+ open-source Python notebooks, we demonstrate that typical loop behaviour ‘in the wild’ is amenable to auto-parallelization. We show that staging the dependence analysis is an effective way to maximize performance. We apply classical dependence analysis techniques, then leverage the Python runtime’s rich introspection capabilities to resolve additional loop bounds and variable types in a just-in-time manner. The parallel loop nest code is then converted to CUDA kernels for GPU execution. We achieve orders of magnitude speedup over baseline interpreted execution and some speedup (up to 50x, although not consistently) over CPU JIT-compiled execution, across 12 loop-intensive standard benchmarks
Python Programmers Have GPUs Too: Automatic Python Loop Parallelization with Staged Dependence Analysis
Python is a popular language for end-user software development in many application domains. End-users want to harness parallel compute resources effectively, by exploiting commodity manycore technology including GPUs. However, existing approaches to parallelism in Python are esoteric, and generally seem too complex for the typical end-user developer. We argue that implicit, or automatic, parallelization is the best way to deliver the benefits of manycore to end-users, since it avoids domain-specific languages, specialist libraries, complex annotations or restrictive language subsets. Auto-parallelization fits the Python philosophy, provides effective performance, and is convenient for non-expert developers.
Despite being a dynamic language, we show that Python is a suitable target for auto-parallelization. In an empirical study of 3000+ open-source Python notebooks, we demonstrate that typical loop behaviour ‘in the wild’ is amenable to auto-parallelization. We show that staging the dependence analysis is an effective way to maximize performance. We apply classical dependence analysis techniques, then leverage the Python runtime’s rich introspection capabilities to resolve additional loop bounds and variable types in a just-in-time manner. The parallel loop nest code is then converted to CUDA kernels for GPU execution. We achieve orders of magnitude speedup over baseline interpreted execution and some speedup (up to 50x, although not consistently) over CPU JIT-compiled execution, across 12 loop-intensive standard benchmarks
Can We Run in Parallel? Automating Loop Parallelization for TornadoVM
With the advent of multi-core systems, GPUs and FPGAs, loop parallelization
has become a promising way to speed-up program execution. In order to stay up
with time, various performance-oriented programming languages provide a
multitude of constructs to allow programmers to write parallelizable loops.
Correspondingly, researchers have developed techniques to automatically
parallelize loops that do not carry dependences across iterations, and/or call
pure functions. However, in managed languages with platform-independent
runtimes such as Java, it is practically infeasible to perform complex
dependence analysis during JIT compilation. In this paper, we propose
AutoTornado, a first of its kind static+JIT loop parallelizer for Java programs
that parallelizes loops for heterogeneous architectures using TornadoVM (a
Graal-based VM that supports insertion of @Parallel constructs for loop
parallelization).
AutoTornado performs sophisticated dependence and purity analysis of Java
programs statically, in the Soot framework, to generate constraints encoding
conditions under which a given loop can be parallelized. The generated
constraints are then fed to the Z3 theorem prover (which we have integrated
with Soot) to annotate canonical for loops that can be parallelized using the
@Parallel construct. We have also added runtime support in TornadoVM to use
static analysis results for loop parallelization. Our evaluation over several
standard parallelization kernels shows that AutoTornado correctly parallelizes
61.3% of manually parallelizable loops, with an efficient static analysis and a
near-zero runtime overhead. To the best of our knowledge, AutoTornado is not
only the first tool that performs program-analysis based parallelization for a
real-world JVM, but also the first to integrate Z3 with Soot for loop
parallelization
ALPyNA: Acceleration of Loops in Python for Novel Architectures
We present ALPyNA, an automatic loop parallelization framework for Python, which analyzes data dependences within nested loops and dynamically generates CUDA kernels for GPU execution. The ALPyNA system applies classical dependence analysis techniques to discover and exploit potential parallelism. The skeletal structure of the dependence graph is determined statically (if possible) or at runtime; this is combined with type and bounds information discovered at runtime, to auto-generate high-performance kernels for offload to GPU.
We demonstrate speedups of up to 1000x relative to the native CPython interpreter across four array-intensive numerical Python benchmarks. Performance improvement is related to both iteration domain size and dependence graph complexity. Nevertheless, this approach promises to bring the benefits of manycore parallelism to application developers
Kranc: a Mathematica application to generate numerical codes for tensorial evolution equations
We present a suite of Mathematica-based computer-algebra packages, termed
"Kranc", which comprise a toolbox to convert (tensorial) systems of partial
differential evolution equations to parallelized C or Fortran code. Kranc can
be used as a "rapid prototyping" system for physicists or mathematicians
handling very complicated systems of partial differential equations, but
through integration into the Cactus computational toolkit we can also produce
efficient parallelized production codes. Our work is motivated by the field of
numerical relativity, where Kranc is used as a research tool by the authors. In
this paper we describe the design and implementation of both the Mathematica
packages and the resulting code, we discuss some example applications, and
provide results on the performance of an example numerical code for the
Einstein equations.Comment: 24 pages, 1 figure. Corresponds to journal versio
Compiler and runtime support for shared memory parallelization of data mining algorithms
Abstract. Data mining techniques focus on finding novel and useful patterns or models from large datasets. Because of the volume of the data to be analyzed, the amount of computation involved, and the need for rapid or even interactive analysis, data mining applications require the use of parallel machines. We have been developing compiler and runtime support for developing scalable implementations of data mining algorithms. Our work encompasses shared memory parallelization, distributed memory parallelization, and optimizations for processing disk-resident datasets. In this paper, we focus on compiler and runtime support for shared memory parallelization of data mining algorithms. We have developed a set of parallelization techniques that apply across algorithms for a variety of mining tasks. We describe the interface of the middleware where these techniques are implemented. Then, we present compiler techniques for translating data parallel code to the middleware specification. Finally, we present a brief evaluation of our compiler using apriori association mining and k-means clustering.
Improving the Perfomance of a Pointer-Based, Speculative Parallelization Scheme
La paralelización especulativa es una técnica que intenta extraer paralelismo de los
bucles no paralelizables en tiempo de compilación. La idea subyacente es ejecutar el
código de forma optimista mientras un subsistema comprueba que no se viole la semántica
secuencial. Han sido muchos los trabajos realizados en este campo, sin embargo, no
conocemos ninguno que fuese capaz de paralelizar aplicaciones que utilizasen aritmética
de punteros. En un trabajo previo del autor de esta memoria, se desarrolló una librería
software capaz de soportar este tipo de aplicaciones. No obstante, el software desarrollado
sufría de una limitación muy importante: el tiempo de ejecución de las versiones paralelas
era mayor que el de las versiones secuenciales. A lo largo de este Trabajo de Fin de
Máster, se aborda esta limitación, encontrando y corrigiendo las razones de esta falta de
eficiencia, y situando el trabajo realizado en perspectiva, dentro de las contribuciones mundiales en este ámbito. Los resultados experimentales obtenidos con aplicaciones reales nos permiten afirmar
que estas limitaciones han sido solventadas, ya que obtenemos speedups de hasta de un
1.61 . Así, con la nueva versión de la librería se han llegado a obtener mejoras de hasta
el 421.4% respecto al tiempo de ejecución generado por la versión original de la librería
especulativa.InformáticaMáster en Investigación en Tecnologías de la Información y las Comunicacione
- …