98,533 research outputs found

    Asynchronous Execution of Python Code on Task Based Runtime Systems

    Get PDF
    Despite advancements in the areas of parallel and distributed computing, the complexity of programming on High Performance Computing (HPC) resources has deterred many domain experts, especially in the areas of machine learning and artificial intelligence (AI), from utilizing performance benefits of such systems. Researchers and scientists favor high-productivity languages to avoid the inconvenience of programming in low-level languages and costs of acquiring the necessary skills required for programming at this level. In recent years, Python, with the support of linear algebra libraries like NumPy, has gained popularity despite facing limitations which prevent this code from distributed runs. Here we present a solution which maintains both high level programming abstractions as well as parallel and distributed efficiency. Phylanx, is an asynchronous array processing toolkit which transforms Python and NumPy operations into code which can be executed in parallel on HPC resources by mapping Python and NumPy functions and variables into a dependency tree executed by HPX, a general purpose, parallel, task-based runtime system written in C++. Phylanx additionally provides introspection and visualization capabilities for debugging and performance analysis. We have tested the foundations of our approach by comparing our implementation of widely used machine learning algorithms to accepted NumPy standards

    AllScale API

    Get PDF
    Effectively implementing scientific algorithms in distributed memory parallel applications is a difficult task for domain scientists, as evident by the large number of domain-specific languages and libraries available today attempting to facilitate the process. However, they usually provide a closed set of parallel patterns and are not open for extension without vast modifications to the underlying system. In this work, we present the AllScale API, a programming interface for developing distributed memory parallel applications with the ease of shared memory programming models. The AllScale API is closed for a modification but open for an extension, allowing new user-defined parallel patterns and data structures to be implemented based on existing core primitives and therefore fully supported in the AllScale framework. Focusing on high-level functionality directly offered to application developers, we present the design advantages of such an API design, detail some of its specifications and evaluate it using three real-world use cases. Our results show that AllScale decreases the complexity of implementing scientific applications for distributed memory while attaining comparable or higher performance compared to MPI reference implementations

    Parallel global three-dimensional electro-magnetic particle model simulation and its application to space weather problems

    Get PDF
    Parallel Computing involves all of the challenges of serial programming, such as data partitioning, task partitioning, task scheduling, parallel debugging, and synchronization. The field of parallel computing is growing and becoming synonymous to mainline computing topics, such as parallel performance and architecture, theory and complexity analysis of parallel algorithms, design and scheduling of parallel tasks, andprogramming languages and systems for writing parallel programs on data parallel, control parallel, and parallel programming. Besides emitting a continuous stream of plasma called the solar wind, the sun periodically releases billions of tons of matter that are called coronal mass ejections (CME). These immense clouds of material, when directed towards the earth, can cause large magnetic storms in the magnetosphere and the upper atmosphere. Magneticstorms produce many noticeable effects on and near the earth. They are often referred as the space weather problems. Energetic particle entry into cusp and inner magnetosphere and its kinetic characteristics are also very important in space weather problems. In order to predict the space weather of the earth, many observation satellites, such as, cluster II, IMP 8, etc., are launched and numerous computer simulations are performed. Although global three-dimensional magnetohydrodynamic (3DMHD) simulations providea quantitative picture, it can’t include kinetic effects. Hybrid simulation also only processes ion as particle and electron as fluid. In order to include the full self-consistent kinetic characteristics, our global three-dimensional electro-magnetic particle model (3DEMPM) simulations become an important tool to reveal the phenomena in space weather problems. In this thesis, based on SPMD (Single Program Multiple Data) Model, we construct a global three-dimensional electro-magnetic particle model (3DEMPM) using HPF (High Performance Fortran) that is an extended version of Fortran 90 for parallel systems and has become a standard parallel programming language in scientific calculation fields. ・・・Thesis (Ph. D. in Engineering)--University of Tsukuba, (A), no. 3683, 2005.3.25Includes bibliographical references著者の希望により要旨のみ公

    Managing Communication Latency-Hiding at Runtime for Parallel Programming Languages and Libraries

    Full text link
    This work introduces a runtime model for managing communication with support for latency-hiding. The model enables non-computer science researchers to exploit communication latency-hiding techniques seamlessly. For compiled languages, it is often possible to create efficient schedules for communication, but this is not the case for interpreted languages. By maintaining data dependencies between scheduled operations, it is possible to aggressively initiate communication and lazily evaluate tasks to allow maximal time for the communication to finish before entering a wait state. We implement a heuristic of this model in DistNumPy, an auto-parallelizing version of numerical Python that allows sequential NumPy programs to run on distributed memory architectures. Furthermore, we present performance comparisons for eight benchmarks with and without automatic latency-hiding. The results shows that our model reduces the time spent on waiting for communication as much as 27 times, from a maximum of 54% to only 2% of the total execution time, in a stencil application.Comment: PREPRIN

    Learning from the Success of MPI

    Full text link
    The Message Passing Interface (MPI) has been extremely successful as a portable way to program high-performance parallel computers. This success has occurred in spite of the view of many that message passing is difficult and that other approaches, including automatic parallelization and directive-based parallelism, are easier to use. This paper argues that MPI has succeeded because it addresses all of the important issues in providing a parallel programming model.Comment: 12 pages, 1 figur

    An Evaluation of the X10 Programming Language

    Get PDF
    As predicted by Moore\u27s law, the number of transistors on a chip has been doubled approximately every two years. As miraculous as it sounds, for many years, the extra transistors have massively benefited the whole computer industry, by using the extra transistors to increase CPU clock speed, thus boosting performance. However, due to heat wall and power constraints, the clock speed cannot be increased limitlessly. Hardware vendors now have to take another path other than increasing clock speed, which is to utilize the transistors to increase the number of processor cores on each chip. This hardware structural change presents inevitable challenges to software structure, where single thread targeted software will not benefit from newer chips or may even suffer from lower clock speed. The two fundamental challenges are: 1. How to deal with the stagnation of single core clock speed and cache memory. 2. How to utilize the additional power generated from more cores on a chip. Most software programming languages nowadays have distributed computing support, such as C and Java [1]. Meanwhile, some new programming languages were invented from scratch just to take advantage of the more distributed hardware structures. The X10 Programming Language is one of them. The goal of this project is to evaluate X10 in terms of performance, programmability and tool support
    corecore