Search CORE

7 research outputs found

Boosting Java Performance Using GPGPUs

Author: Brown Gavin
Clarkson James
Kotselidis Christos
Luján Mikel
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 04/03/2017
Field of study

Crossref

The University of Manchester - Institutional Repository

ALPyNA: Acceleration of Loops in Python for Novel Architectures

Author: Jacob Dejice
Singer Jeremy
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

We present ALPyNA, an automatic loop parallelization framework for Python, which analyzes data dependences within nested loops and dynamically generates CUDA kernels for GPU execution. The ALPyNA system applies classical dependence analysis techniques to discover and exploit potential parallelism. The skeletal structure of the dependence graph is determined statically (if possible) or at runtime; this is combined with type and bounds information discovered at runtime, to auto-generate high-performance kernels for offload to GPU. We demonstrate speedups of up to 1000x relative to the native CPython interpreter across four array-intensive numerical Python benchmarks. Performance improvement is related to both iteration domain size and dependence graph complexity. Nevertheless, this approach promises to bring the benefits of manycore parallelism to application developers

Crossref

Enlighten

Python Programmers Have GPUs Too: Automatic Python Loop Parallelization with Staged Dependence Analysis

Author: Jacob Dejice
Trinder Phil
Singer Jeremy
Publication venue
Publication date: 20/10/2019
Field of study

Python is a popular language for end-user software development in many application domains. End-users want to harness parallel compute resources effectively, by exploiting commodity manycore technology including GPUs. However, existing approaches to parallelism in Python are esoteric, and generally seem too complex for the typical end-user developer. We argue that implicit, or automatic, parallelization is the best way to deliver the benefits of manycore to end-users, since it avoids domain-specific languages, specialist libraries, complex annotations or restrictive language subsets. Auto-parallelization fits the Python philosophy, provides effective performance, and is convenient for non-expert developers. Despite being a dynamic language, we show that Python is a suitable target for auto-parallelization. In an empirical study of 3000+ open-source Python notebooks, we demonstrate that typical loop behaviour ‘in the wild’ is amenable to auto-parallelization. We show that staging the dependence analysis is an effective way to maximize performance. We apply classical dependence analysis techniques, then leverage the Python runtime’s rich introspection capabilities to resolve additional loop bounds and variable types in a just-in-time manner. The parallel loop nest code is then converted to CUDA kernels for GPU execution. We achieve orders of magnitude speedup over baseline interpreted execution and some speedup (up to 50x, although not consistently) over CPU JIT-compiled execution, across 12 loop-intensive standard benchmarks

OPUS Augsburg

Enlighten

Python Programmers Have GPUs Too: Automatic Python Loop Parallelization with Staged Dependence Analysis

Author: Abadi Martín
Beazley David
Pedregosa Fabian
Pope-Carter Finnegan
Rubinsteyn Alex
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 20/10/2019
Field of study

Crossref

Enlighten