2 research outputs found

    Pricing Python Parallelism: A Dynamic Language Cost Model for Heterogeneous Platforms

    Get PDF
    Execution times may be reduced by offloading parallel loop nests to a GPU. Auto-parallelizing compilers are common for static languages, often using a cost model to determine when the GPU execution speed will outweigh the offload overheads. Nowadays scientific software is increasingly written in dynamic languages and would benefit from compute accelerators. The ALPyNA framework analyses moderately complex Python loop nests and automatically JIT compiles code for heterogeneous CPU and GPU architectures. We present the first analytical cost model for auto-parallelizing loop nests in a dynamic language on heterogeneous architectures. Predicting execution time in a language like Python is extremely challenging, since aspects like the element types, size of the iteration space, and amenability to parallelization can only be determined at runtime. Hence the cost model must be both staged, to combine compile and run-time information, and lightweight to minimize runtime overhead. GPU execution time prediction must account for factors like data transfer, block-structured execution, and starvation. We show that a comparatively simple, staged analytical model can accurately determine during execution when it is profitable to offload a loop nest. We evaluate our model on three heterogeneous platforms across 360 experiments with 12 loop-intensive Python benchmark programs. The results show small misprediction intervals and a mean slowdown of just 13.6%, relative to the optimal (oracular) offload strategy

    Cache Size in a Cost Model for Heterogeneous Skeletons

    No full text
    High performance architectures are increasingly heterogeneous with shared and distributed memory components. Programming such architectures is complicated and performance portability is a major issue as the architectures evolve. This paper proposes a new architectural cost model that accounts for cache size and improves on heterogeneous architectures, and demonstrates a skeleton-based programming model that simplifies programming heterogeneous architectures. We further demonstrate that the cost model can be exploited by skeletons to improve load balancing on heterogeneous architectures. The heterogeneous skeleton model facilitates performance portability, using the architectural cost model to automatically balance load across heterogeneous components of the architecture. For both a data parallel benchmark, and realistic image processing program we obtain good performance for the heterogeneous skeleton on homogeneous shared and distributed memory architectures, and on three heterogeneous architectures. We also show that taking cache size into account in the model leads to improved balance and performance
    corecore