2,897 research outputs found

    Python Programmers Have GPUs Too: Automatic Python Loop Parallelization with Staged Dependence Analysis

    Get PDF
    Python is a popular language for end-user software development in many application domains. End-users want to harness parallel compute resources effectively, by exploiting commodity manycore technology including GPUs. However, existing approaches to parallelism in Python are esoteric, and generally seem too complex for the typical end-user developer. We argue that implicit, or automatic, parallelization is the best way to deliver the benefits of manycore to end-users, since it avoids domain-specific languages, specialist libraries, complex annotations or restrictive language subsets. Auto-parallelization fits the Python philosophy, provides effective performance, and is convenient for non-expert developers. Despite being a dynamic language, we show that Python is a suitable target for auto-parallelization. In an empirical study of 3000+ open-source Python notebooks, we demonstrate that typical loop behaviour ‘in the wild’ is amenable to auto-parallelization. We show that staging the dependence analysis is an effective way to maximize performance. We apply classical dependence analysis techniques, then leverage the Python runtime’s rich introspection capabilities to resolve additional loop bounds and variable types in a just-in-time manner. The parallel loop nest code is then converted to CUDA kernels for GPU execution. We achieve orders of magnitude speedup over baseline interpreted execution and some speedup (up to 50x, although not consistently) over CPU JIT-compiled execution, across 12 loop-intensive standard benchmarks

    Python Programmers Have GPUs Too: Automatic Python Loop Parallelization with Staged Dependence Analysis

    Get PDF
    Python is a popular language for end-user software development in many application domains. End-users want to harness parallel compute resources effectively, by exploiting commodity manycore technology including GPUs. However, existing approaches to parallelism in Python are esoteric, and generally seem too complex for the typical end-user developer. We argue that implicit, or automatic, parallelization is the best way to deliver the benefits of manycore to end-users, since it avoids domain-specific languages, specialist libraries, complex annotations or restrictive language subsets. Auto-parallelization fits the Python philosophy, provides effective performance, and is convenient for non-expert developers. Despite being a dynamic language, we show that Python is a suitable target for auto-parallelization. In an empirical study of 3000+ open-source Python notebooks, we demonstrate that typical loop behaviour ‘in the wild’ is amenable to auto-parallelization. We show that staging the dependence analysis is an effective way to maximize performance. We apply classical dependence analysis techniques, then leverage the Python runtime’s rich introspection capabilities to resolve additional loop bounds and variable types in a just-in-time manner. The parallel loop nest code is then converted to CUDA kernels for GPU execution. We achieve orders of magnitude speedup over baseline interpreted execution and some speedup (up to 50x, although not consistently) over CPU JIT-compiled execution, across 12 loop-intensive standard benchmarks

    Can We Run in Parallel? Automating Loop Parallelization for TornadoVM

    Full text link
    With the advent of multi-core systems, GPUs and FPGAs, loop parallelization has become a promising way to speed-up program execution. In order to stay up with time, various performance-oriented programming languages provide a multitude of constructs to allow programmers to write parallelizable loops. Correspondingly, researchers have developed techniques to automatically parallelize loops that do not carry dependences across iterations, and/or call pure functions. However, in managed languages with platform-independent runtimes such as Java, it is practically infeasible to perform complex dependence analysis during JIT compilation. In this paper, we propose AutoTornado, a first of its kind static+JIT loop parallelizer for Java programs that parallelizes loops for heterogeneous architectures using TornadoVM (a Graal-based VM that supports insertion of @Parallel constructs for loop parallelization). AutoTornado performs sophisticated dependence and purity analysis of Java programs statically, in the Soot framework, to generate constraints encoding conditions under which a given loop can be parallelized. The generated constraints are then fed to the Z3 theorem prover (which we have integrated with Soot) to annotate canonical for loops that can be parallelized using the @Parallel construct. We have also added runtime support in TornadoVM to use static analysis results for loop parallelization. Our evaluation over several standard parallelization kernels shows that AutoTornado correctly parallelizes 61.3% of manually parallelizable loops, with an efficient static analysis and a near-zero runtime overhead. To the best of our knowledge, AutoTornado is not only the first tool that performs program-analysis based parallelization for a real-world JVM, but also the first to integrate Z3 with Soot for loop parallelization

    ALPyNA: Acceleration of Loops in Python for Novel Architectures

    Get PDF
    We present ALPyNA, an automatic loop parallelization framework for Python, which analyzes data dependences within nested loops and dynamically generates CUDA kernels for GPU execution. The ALPyNA system applies classical dependence analysis techniques to discover and exploit potential parallelism. The skeletal structure of the dependence graph is determined statically (if possible) or at runtime; this is combined with type and bounds information discovered at runtime, to auto-generate high-performance kernels for offload to GPU. We demonstrate speedups of up to 1000x relative to the native CPython interpreter across four array-intensive numerical Python benchmarks. Performance improvement is related to both iteration domain size and dependence graph complexity. Nevertheless, this approach promises to bring the benefits of manycore parallelism to application developers

    Kranc: a Mathematica application to generate numerical codes for tensorial evolution equations

    Full text link
    We present a suite of Mathematica-based computer-algebra packages, termed "Kranc", which comprise a toolbox to convert (tensorial) systems of partial differential evolution equations to parallelized C or Fortran code. Kranc can be used as a "rapid prototyping" system for physicists or mathematicians handling very complicated systems of partial differential equations, but through integration into the Cactus computational toolkit we can also produce efficient parallelized production codes. Our work is motivated by the field of numerical relativity, where Kranc is used as a research tool by the authors. In this paper we describe the design and implementation of both the Mathematica packages and the resulting code, we discuss some example applications, and provide results on the performance of an example numerical code for the Einstein equations.Comment: 24 pages, 1 figure. Corresponds to journal versio

    Compiler and runtime support for shared memory parallelization of data mining algorithms

    Get PDF
    Abstract. Data mining techniques focus on finding novel and useful patterns or models from large datasets. Because of the volume of the data to be analyzed, the amount of computation involved, and the need for rapid or even interactive analysis, data mining applications require the use of parallel machines. We have been developing compiler and runtime support for developing scalable implementations of data mining algorithms. Our work encompasses shared memory parallelization, distributed memory parallelization, and optimizations for processing disk-resident datasets. In this paper, we focus on compiler and runtime support for shared memory parallelization of data mining algorithms. We have developed a set of parallelization techniques that apply across algorithms for a variety of mining tasks. We describe the interface of the middleware where these techniques are implemented. Then, we present compiler techniques for translating data parallel code to the middleware specification. Finally, we present a brief evaluation of our compiler using apriori association mining and k-means clustering.

    Autotuning for Automatic Parallelization on Heterogeneous Systems

    Get PDF

    Improving the Perfomance of a Pointer-Based, Speculative Parallelization Scheme

    Get PDF
    La paralelización especulativa es una técnica que intenta extraer paralelismo de los bucles no paralelizables en tiempo de compilación. La idea subyacente es ejecutar el código de forma optimista mientras un subsistema comprueba que no se viole la semántica secuencial. Han sido muchos los trabajos realizados en este campo, sin embargo, no conocemos ninguno que fuese capaz de paralelizar aplicaciones que utilizasen aritmética de punteros. En un trabajo previo del autor de esta memoria, se desarrolló una librería software capaz de soportar este tipo de aplicaciones. No obstante, el software desarrollado sufría de una limitación muy importante: el tiempo de ejecución de las versiones paralelas era mayor que el de las versiones secuenciales. A lo largo de este Trabajo de Fin de Máster, se aborda esta limitación, encontrando y corrigiendo las razones de esta falta de eficiencia, y situando el trabajo realizado en perspectiva, dentro de las contribuciones mundiales en este ámbito. Los resultados experimentales obtenidos con aplicaciones reales nos permiten afirmar que estas limitaciones han sido solventadas, ya que obtenemos speedups de hasta de un 1.61 . Así, con la nueva versión de la librería se han llegado a obtener mejoras de hasta el 421.4% respecto al tiempo de ejecución generado por la versión original de la librería especulativa.InformáticaMáster en Investigación en Tecnologías de la Información y las Comunicacione
    corecore