7,028 research outputs found

    Multi-criteria scheduling of pipeline workflows

    Get PDF
    Mapping workflow applications onto parallel platforms is a challenging problem, even for simple application patterns such as pipeline graphs. Several antagonist criteria should be optimized, such as throughput and latency (or a combination). In this paper, we study the complexity of the bi-criteria mapping problem for pipeline graphs on communication homogeneous platforms. In particular, we assess the complexity of the well-known chains-to-chains problem for different-speed processors, which turns out to be NP-hard. We provide several efficient polynomial bi-criteria heuristics, and their relative performance is evaluated through extensive simulations

    Distributed and parallel sparse convex optimization for radio interferometry with PURIFY

    Full text link
    Next generation radio interferometric telescopes are entering an era of big data with extremely large data sets. While these telescopes can observe the sky in higher sensitivity and resolution than before, computational challenges in image reconstruction need to be overcome to realize the potential of forthcoming telescopes. New methods in sparse image reconstruction and convex optimization techniques (cf. compressive sensing) have shown to produce higher fidelity reconstructions of simulations and real observations than traditional methods. This article presents distributed and parallel algorithms and implementations to perform sparse image reconstruction, with significant practical considerations that are important for implementing these algorithms for Big Data. We benchmark the algorithms presented, showing that they are considerably faster than their serial equivalents. We then pre-sample gridding kernels to scale the distributed algorithms to larger data sizes, showing application times for 1 Gb to 2.4 Tb data sets over 25 to 100 nodes for up to 50 billion visibilities, and find that the run-times for the distributed algorithms range from 100 milliseconds to 3 minutes per iteration. This work presents an important step in working towards computationally scalable and efficient algorithms and implementations that are needed to image observations of both extended and compact sources from next generation radio interferometers such as the SKA. The algorithms are implemented in the latest versions of the SOPT (https://github.com/astro-informatics/sopt) and PURIFY (https://github.com/astro-informatics/purify) software packages {(Versions 3.1.0)}, which have been released alongside of this article.Comment: 25 pages, 5 figure

    Search-based Model-driven Loop Optimizations for Tensor Contractions

    Get PDF
    Complex tensor contraction expressions arise in accurate electronic structure models in quantum chemistry, such as the coupled cluster method. The Tensor Contraction Engine (TCE) is a high-level program synthesis system that facilitates the generation of high-performance parallel programs from tensor contraction equations. We are developing a new software infrastructure for the TCE that is designed to allow experimentation with optimization algorithms for modern computing platforms, including for heterogeneous architectures employing general-purpose graphics processing units (GPGPUs). In this dissertation, we present improvements and extensions to the loop fusion optimization algorithm, which can be used with cost models, e.g., for minimizing memory usage or for minimizing data movement costs under a memory constraint. We show that our data structure and pruning improvements to the loop fusion algorithm result in significant performance improvements that enable complex cost models being use for large input equations. We also present an algorithm for optimizing the fused loop structure of handwritten code. It determines the regions in handwritten code that are safe to be optimized and then runs the loop fusion algorithm on the dependency graph of the code. Finally, we develop an optimization framework for generating GPGPU code consisting of loop fusion optimization with a novel cost model, tiling optimization, and layout optimization. Depending on the memory available on the GPGPU and the sizes of the tensors, our framework decides which processor (CPU or GPGPU) should perform an operation and where the result should be moved. We present extensive measurements for tuning the loop fusion algorithm, for validating our optimization framework, and for measuring the performance characteristics of GPGPUs. Our measurements demonstrate that our optimization framework outperforms existing general-purpose optimization approaches both on multi-core CPUs and on GPGPUs

    Model-driven search-based loop fusion optimization for handwritten code

    Get PDF
    The Tensor Contraction Engine (TCE) is a compiler that translates high-level, mathematical tensor contraction expressions into efficient, parallel Fortran code. A pair of optimizations in the TCE, the fusion and tiling optimizations, have proven successful for minimizing disk-to-memory traffic for dense tensor computations. While other optimizations are specific to tensor contraction expressions, these two model-driven search-based optimization algorithms could also be useful for optimizing handwritten dense array computations to minimize disk to memory traffic. In this thesis, we show how to apply the loop fusion algorithm to handwritten code in a procedural language. While in the TCE the loop fusion algorithm operated on high-level expression trees, in a standard compiler it needs to operate on abstract syntax trees. For simplicity, we use the fusion algorithm only for memory minimization instead of for minimizing disk-to-memory traffic. Also, we limit ourselves to handwritten, dense array computations in which loop bounds expressions are constant, subscript expressions are simple loop variables, and there are no common subexpressions. After type-checking, we canonicalize the abstract syntax tree to move side effects and loop-invariant code out of larger expressions. Using dataflow analysis, we then compute reaching definitions and add use-def chains to the abstract syntax tree. After undoing any partial loop fusion, a generalized loop fusion algorithm traverses the abstract syntax tree together with the use-def chains. Finally, the abstract syntax tree is rewritten to reflect the loop structure found by the loop fusion algorithm. We outline how the constraints on loop bounds expressions and array index expressions could be removed in the future using an algebraic cost model and an analysis of the iteration space using a polyhedral model
    • …
    corecore