Search CORE

2,897 research outputs found

Python Programmers Have GPUs Too: Automatic Python Loop Parallelization with Staged Dependence Analysis

Author: Jacob Dejice
Trinder Phil
Singer Jeremy
Publication venue
Publication date: 20/10/2019
Field of study

Python is a popular language for end-user software development in many application domains. End-users want to harness parallel compute resources effectively, by exploiting commodity manycore technology including GPUs. However, existing approaches to parallelism in Python are esoteric, and generally seem too complex for the typical end-user developer. We argue that implicit, or automatic, parallelization is the best way to deliver the benefits of manycore to end-users, since it avoids domain-specific languages, specialist libraries, complex annotations or restrictive language subsets. Auto-parallelization fits the Python philosophy, provides effective performance, and is convenient for non-expert developers. Despite being a dynamic language, we show that Python is a suitable target for auto-parallelization. In an empirical study of 3000+ open-source Python notebooks, we demonstrate that typical loop behaviour ‘in the wild’ is amenable to auto-parallelization. We show that staging the dependence analysis is an effective way to maximize performance. We apply classical dependence analysis techniques, then leverage the Python runtime’s rich introspection capabilities to resolve additional loop bounds and variable types in a just-in-time manner. The parallel loop nest code is then converted to CUDA kernels for GPU execution. We achieve orders of magnitude speedup over baseline interpreted execution and some speedup (up to 50x, although not consistently) over CPU JIT-compiled execution, across 12 loop-intensive standard benchmarks

OPUS Augsburg

Enlighten

Python Programmers Have GPUs Too: Automatic Python Loop Parallelization with Staged Dependence Analysis

Author: Abadi Martín
Beazley David
Pedregosa Fabian
Pope-Carter Finnegan
Rubinsteyn Alex
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 20/10/2019
Field of study

Crossref

Enlighten

Can We Run in Parallel? Automating Loop Parallelization for TornadoVM

Author: Kulshreshtha Shreyansh
Sharma Rishi
Thakur Manas
Publication venue
Publication date: 07/05/2022
Field of study

With the advent of multi-core systems, GPUs and FPGAs, loop parallelization has become a promising way to speed-up program execution. In order to stay up with time, various performance-oriented programming languages provide a multitude of constructs to allow programmers to write parallelizable loops. Correspondingly, researchers have developed techniques to automatically parallelize loops that do not carry dependences across iterations, and/or call pure functions. However, in managed languages with platform-independent runtimes such as Java, it is practically infeasible to perform complex dependence analysis during JIT compilation. In this paper, we propose AutoTornado, a first of its kind static+JIT loop parallelizer for Java programs that parallelizes loops for heterogeneous architectures using TornadoVM (a Graal-based VM that supports insertion of @Parallel constructs for loop parallelization). AutoTornado performs sophisticated dependence and purity analysis of Java programs statically, in the Soot framework, to generate constraints encoding conditions under which a given loop can be parallelized. The generated constraints are then fed to the Z3 theorem prover (which we have integrated with Soot) to annotate canonical for loops that can be parallelized using the @Parallel construct. We have also added runtime support in TornadoVM to use static analysis results for loop parallelization. Our evaluation over several standard parallelization kernels shows that AutoTornado correctly parallelizes 61.3% of manually parallelizable loops, with an efficient static analysis and a near-zero runtime overhead. To the best of our knowledge, AutoTornado is not only the first tool that performs program-analysis based parallelization for a real-world JVM, but also the first to integrate Z3 with Soot for loop parallelization

arXiv.org e-Print Archive

ALPyNA: Acceleration of Loops in Python for Novel Architectures

Author: Jacob Dejice
Singer Jeremy
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

We present ALPyNA, an automatic loop parallelization framework for Python, which analyzes data dependences within nested loops and dynamically generates CUDA kernels for GPU execution. The ALPyNA system applies classical dependence analysis techniques to discover and exploit potential parallelism. The skeletal structure of the dependence graph is determined statically (if possible) or at runtime; this is combined with type and bounds information discovered at runtime, to auto-generate high-performance kernels for offload to GPU. We demonstrate speedups of up to 1000x relative to the native CPython interpreter across four array-intensive numerical Python benchmarks. Performance improvement is related to both iteration domain size and dependence graph complexity. Nevertheless, this approach promises to bring the benefits of manycore parallelism to application developers

Crossref

Enlighten

Kranc: a Mathematica application to generate numerical codes for tensorial evolution equations

Author: Alcubierre
Alcubierre
Allen
Christiane Lechner
Cook
Cook
Diener
Goodale
Gustafsson
Hawke
Husa
Ian Hinder
Lee
Lehner
Martín-García
Musgrave
Parker
Pasquali
Penrose
Ruppelt
Sascha Husa
Talbot
Thornburg
Thornburg
Wald
York
Publication venue: 'Elsevier BV'
Publication date: 02/08/2010
Field of study

We present a suite of Mathematica-based computer-algebra packages, termed "Kranc", which comprise a toolbox to convert (tensorial) systems of partial differential evolution equations to parallelized C or Fortran code. Kranc can be used as a "rapid prototyping" system for physicists or mathematicians handling very complicated systems of partial differential equations, but through integration into the Cactus computational toolkit we can also produce efficient parallelized production codes. Our work is motivated by the field of numerical relativity, where Kranc is used as a research tool by the authors. In this paper we describe the design and implementation of both the Mathematica packages and the resulting code, we discuss some example applications, and provide results on the performance of an example numerical code for the Einstein equations.Comment: 24 pages, 1 figure. Corresponds to journal versio

arXiv.org e-Print Archive

Crossref

Compiler and runtime support for shared memory parallelization of data mining algorithms

Author: Gagan Agrawal
Ruoming Jin
Xiaogang Li
Publication venue
Publication date
Field of study

Abstract. Data mining techniques focus on finding novel and useful patterns or models from large datasets. Because of the volume of the data to be analyzed, the amount of computation involved, and the need for rapid or even interactive analysis, data mining applications require the use of parallel machines. We have been developing compiler and runtime support for developing scalable implementations of data mining algorithms. Our work encompasses shared memory parallelization, distributed memory parallelization, and optimizations for processing disk-resident datasets. In this paper, we focus on compiler and runtime support for shared memory parallelization of data mining algorithms. We have developed a set of parallelization techniques that apply across algorithms for a variety of mining tasks. We describe the interface of the middleware where these techniques are implemented. Then, we present compiler techniques for translating data parallel code to the middleware specification. Finally, we present a brief evaluation of our compiler using apriori association mining and k-means clustering.

CiteSeerX

Autotuning for Automatic Parallelization on Heterogeneous Systems

Author: Pfaffe Philip
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2020
Field of study

KITopen

Improving the Perfomance of a Pointer-Based, Speculative Parallelization Scheme

Author: Estébanez López Álvaro
Publication venue
Publication date: 01/01/2013
Field of study

La paralelización especulativa es una técnica que intenta extraer paralelismo de los bucles no paralelizables en tiempo de compilación. La idea subyacente es ejecutar el código de forma optimista mientras un subsistema comprueba que no se viole la semántica secuencial. Han sido muchos los trabajos realizados en este campo, sin embargo, no conocemos ninguno que fuese capaz de paralelizar aplicaciones que utilizasen aritmética de punteros. En un trabajo previo del autor de esta memoria, se desarrolló una librería software capaz de soportar este tipo de aplicaciones. No obstante, el software desarrollado sufría de una limitación muy importante: el tiempo de ejecución de las versiones paralelas era mayor que el de las versiones secuenciales. A lo largo de este Trabajo de Fin de Máster, se aborda esta limitación, encontrando y corrigiendo las razones de esta falta de eficiencia, y situando el trabajo realizado en perspectiva, dentro de las contribuciones mundiales en este ámbito. Los resultados experimentales obtenidos con aplicaciones reales nos permiten afirmar que estas limitaciones han sido solventadas, ya que obtenemos speedups de hasta de un 1.61 . Así, con la nueva versión de la librería se han llegado a obtener mejoras de hasta el 421.4% respecto al tiempo de ejecución generado por la versión original de la librería especulativa.InformáticaMáster en Investigación en Tecnologías de la Información y las Comunicacione

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Documental de la Universidad de Valladolid