255 research outputs found

    Scalable Fast Multipole Methods on Heterogeneous Architecture

    Get PDF
    The N-body problem appears in many computational physics simulations. At each time step the computation involves an all-pairs sum whose complexity is quadratic, followed by an update of particle positions. This cost means that it is not practical to solve such dynamic N-body problems on large scale. To improve this situation, we use both algorithmic and hardware approaches. Our algorithmic approach is to use the Fast Multipole Method (FMM), which is a divide-and-conquer algorithm that performs a fast N-body sum using a spatial decomposition and is often used in a time-stepping or iterative loop, to reduce such quadratic complexity to linear with guaranteed accuracy. Our hardware approach is to use heterogeneous clusters, which comprised of nodes that contain multi-core CPUs tightly coupled with accelerators, such as graphics processors unit (GPU) as our underline parallel processing hardware, on which efficient implementations require highly non-trivial re-designed algorithms. In this dissertation, we fundamentally reconsider the FMM algorithms on heterogeneous architectures to achieve a significant improvement over recent/previous implementations in literature and to make the algorithm ready for use as a workhorse simulation tool for both time-dependent vortex flow problems and for boundary element methods. Our major contributions include: 1. Novel FMM data structures using parallel construction algorithms for dynamic problems. 2. A fast hetegenenous FMM algorithm for both single and multiple computing nodes. 3. An efficient inter-node communication management using fast parallel data structures. 4. A scalable FMM algorithm using novel Helmholz decomposition for Vortex Methods (VM). The proposed algorithms can handle non-uniform distributions with irregular partition shapes to achieve workload balance and their MPI-CUDA implementations are highly tuned up and demonstrate the state of the art performances

    Automatically Leveraging MapReduce Frameworks for Data-Intensive Applications

    Full text link
    MapReduce is a popular programming paradigm for developing large-scale, data-intensive computation. Many frameworks that implement this paradigm have recently been developed. To leverage these frameworks, however, developers must become familiar with their APIs and rewrite existing code. Casper is a new tool that automatically translates sequential Java programs into the MapReduce paradigm. Casper identifies potential code fragments to rewrite and translates them in two steps: (1) Casper uses program synthesis to search for a program summary (i.e., a functional specification) of each code fragment. The summary is expressed using a high-level intermediate language resembling the MapReduce paradigm and verified to be semantically equivalent to the original using a theorem prover. (2) Casper generates executable code from the summary, using either the Hadoop, Spark, or Flink API. We evaluated Casper by automatically converting real-world, sequential Java benchmarks to MapReduce. The resulting benchmarks perform up to 48.2x faster compared to the original.Comment: 12 pages, additional 4 pages of references and appendi

    Irregular Coarse-Grain Data Parallelism under LPARX

    Get PDF

    Static Analysis-based Debugging, Certification, Testing, and Optimization with CiaoPP

    Get PDF
    Facilitate the development of safe, efficient programs. Approach: •Next-generation, higher-level, multiparadigm prog. languages. •Improved program development environments. •A framework (CiaoPP) which integrates: •Debugging. •Verification and certification. •Testing. •Optimization (optimized compilation, parallelization, ...

    Graphical Reasoning in Compact Closed Categories for Quantum Computation

    Full text link
    Compact closed categories provide a foundational formalism for a variety of important domains, including quantum computation. These categories have a natural visualisation as a form of graphs. We present a formalism for equational reasoning about such graphs and develop this into a generic proof system with a fixed logical kernel for equational reasoning about compact closed categories. Automating this reasoning process is motivated by the slow and error prone nature of manual graph manipulation. A salient feature of our system is that it provides a formal and declarative account of derived results that can include `ellipses'-style notation. We illustrate the framework by instantiating it for a graphical language of quantum computation and show how this can be used to perform symbolic computation.Comment: 21 pages, 9 figures. This is the journal version of the paper published at AIS

    Scalable parallel molecular dynamics algorithms for organic systems

    Get PDF
    A scalable parallel algorithm, Macro-Molecular Dynamics (MMD), has been developed for large-scale molecular dynamics simulations of organic macromolecules, based on space-time multi-resolution techniques and dynamic management of distributed lists. The algorithm also includes the calculation of long range forces using Fast Multipole Method (FMM). FMM is based on the octree data structure, in which each parent cell is divided into 8 child cells and this division continues until the cell size is equal to the non-bonded interaction cutoff length. Due to constant number of operations performed at each stage of the octree, the FMM algorithm scales as O(N). Design and analysis of MMD and FMM algorithms are presented. Scalability tests are performed on three tera-flop machines: 1024-processor Intel Xeon-based Linux cluster, SuperMike at LSU, 1184-processor IBM SP4 Marcellus and the 512-processor Compaq AlphaServer Emerald at the U.S. Army Engineer Research and Development Center (ERDC) MSRC. The tests show that the Linux cluster outperforms the SP4 for the MMD application. The tests also show significant effects of memory- and cache-sharing on the performance

    Verified lifting of stencil computations

    Get PDF
    This paper demonstrates a novel combination of program synthesis and verification to lift stencil computations from low-level Fortran code to a high-level summary expressed using a predicate language. The technique is sound and mostly automated, and leverages counter-example guided inductive synthesis (CEGIS) to find provably correct translations. Lifting existing code to a high-performance description language has a number of benefits, including maintainability and performance portability. For example, our experiments show that the lifted summaries can enable domain specific compilers to do a better job of parallelization as compared to an off-the-shelf compiler working on the original code, and can even support fully automatic migration to hardware accelerators such as GPUs. We have implemented verified lifting in a system called STNG and have evaluated it using microbenchmarks, mini-apps, and real-world applications. We demonstrate the benefits of verified lifting by first automatically summarizing Fortran source code into a high-level predicate language, and subsequently translating the lifted summaries into Halide, with the translated code achieving median performance speedups of 4.1X and up to 24X for non-trivial stencils as compared to the original implementation.United States. Department of Energy. Office of Science (Award DE-SC0008923)United States. Department of Energy. Office of Science (Award DE-SC0005288

    Composable architecture for rack scale big data computing

    No full text
    The rapid growth of cloud computing, both in terms of the spectrum and volume of cloud workloads, necessitate re-visiting the traditional rack-mountable servers based datacenter design. Next generation datacenters need to offer enhanced support for: (i) fast changing system configuration requirements due to workload constraints, (ii) timely adoption of emerging hardware technologies, and (iii) maximal sharing of systems and subsystems in order to lower costs. Disaggregated datacenters, constructed as a collection of individual resources such as CPU, memory, disks etc., and composed into workload execution units on demand, are an interesting new trend that can address the above challenges. In this paper, we demonstrated the feasibility of composable systems through building a rack scale composable system prototype using PCIe switch. Through empirical approaches, we develop assessment of the opportunities and challenges for leveraging the composable architecture for rack scale cloud datacenters with a focus on big data and NoSQL workloads. In particular, we compare and contrast the programming models that can be used to access the composable resources, and developed the implications for the network and resource provisioning and management for rack scale architecture
    corecore