98,533 research outputs found
Asynchronous Execution of Python Code on Task Based Runtime Systems
Despite advancements in the areas of parallel and distributed computing, the
complexity of programming on High Performance Computing (HPC) resources has
deterred many domain experts, especially in the areas of machine learning and
artificial intelligence (AI), from utilizing performance benefits of such
systems. Researchers and scientists favor high-productivity languages to avoid
the inconvenience of programming in low-level languages and costs of acquiring
the necessary skills required for programming at this level. In recent years,
Python, with the support of linear algebra libraries like NumPy, has gained
popularity despite facing limitations which prevent this code from distributed
runs. Here we present a solution which maintains both high level programming
abstractions as well as parallel and distributed efficiency. Phylanx, is an
asynchronous array processing toolkit which transforms Python and NumPy
operations into code which can be executed in parallel on HPC resources by
mapping Python and NumPy functions and variables into a dependency tree
executed by HPX, a general purpose, parallel, task-based runtime system written
in C++. Phylanx additionally provides introspection and visualization
capabilities for debugging and performance analysis. We have tested the
foundations of our approach by comparing our implementation of widely used
machine learning algorithms to accepted NumPy standards
AllScale API
Effectively implementing scientific algorithms in distributed memory parallel applications is a difficult task for domain scientists, as evident by the large number of domain-specific languages and libraries available today attempting to facilitate the process. However, they usually provide a closed set of parallel patterns and are not open for extension without vast modifications to the underlying system. In this work, we present the AllScale API, a programming interface for developing distributed memory parallel applications with the ease of shared memory programming models. The AllScale API is closed for a modification but open for an extension, allowing new user-defined parallel patterns and data structures to be implemented based on existing core primitives and therefore fully supported in the AllScale framework. Focusing on high-level functionality directly offered to application developers, we present the design advantages of such an API design, detail some of its specifications and evaluate it using three real-world use cases. Our results show that AllScale decreases the complexity of implementing scientific applications for distributed memory while attaining comparable or higher performance compared to MPI reference implementations
Parallel global three-dimensional electro-magnetic particle model simulation and its application to space weather problems
Parallel Computing involves all of the challenges of serial programming, such as data partitioning, task partitioning, task scheduling, parallel debugging, and synchronization. The field of parallel computing is growing and becoming synonymous to mainline computing topics, such as parallel performance and architecture, theory and complexity analysis of parallel algorithms, design and scheduling of parallel tasks, andprogramming languages and systems for writing parallel programs on data parallel, control parallel, and parallel programming. Besides emitting a continuous stream of plasma called the solar wind, the sun periodically releases billions of tons of matter that are called coronal mass ejections (CME). These immense clouds of material, when directed towards the earth, can cause large magnetic storms in the magnetosphere and the upper atmosphere. Magneticstorms produce many noticeable effects on and near the earth. They are often referred as the space weather problems. Energetic particle entry into cusp and inner magnetosphere and its kinetic characteristics are also very important in space weather problems. In order to predict the space weather of the earth, many observation satellites, such as, cluster II, IMP 8, etc., are launched and numerous computer simulations are performed. Although global three-dimensional magnetohydrodynamic (3DMHD) simulations providea quantitative picture, it can’t include kinetic effects. Hybrid simulation also only processes ion as particle and electron as fluid. In order to include the full self-consistent kinetic characteristics, our global three-dimensional electro-magnetic particle model (3DEMPM) simulations become an important tool to reveal the phenomena in space weather problems. In this thesis, based on SPMD (Single Program Multiple Data) Model, we construct a global three-dimensional electro-magnetic particle model (3DEMPM) using HPF (High Performance Fortran) that is an extended version of Fortran 90 for parallel systems and has become a standard parallel programming language in scientific calculation fields. ・・・Thesis (Ph. D. in Engineering)--University of Tsukuba, (A), no. 3683, 2005.3.25Includes bibliographical references著者の希望により要旨のみ公
Managing Communication Latency-Hiding at Runtime for Parallel Programming Languages and Libraries
This work introduces a runtime model for managing communication with support
for latency-hiding. The model enables non-computer science researchers to
exploit communication latency-hiding techniques seamlessly. For compiled
languages, it is often possible to create efficient schedules for
communication, but this is not the case for interpreted languages. By
maintaining data dependencies between scheduled operations, it is possible to
aggressively initiate communication and lazily evaluate tasks to allow maximal
time for the communication to finish before entering a wait state. We implement
a heuristic of this model in DistNumPy, an auto-parallelizing version of
numerical Python that allows sequential NumPy programs to run on distributed
memory architectures. Furthermore, we present performance comparisons for eight
benchmarks with and without automatic latency-hiding. The results shows that
our model reduces the time spent on waiting for communication as much as 27
times, from a maximum of 54% to only 2% of the total execution time, in a
stencil application.Comment: PREPRIN
Learning from the Success of MPI
The Message Passing Interface (MPI) has been extremely successful as a
portable way to program high-performance parallel computers. This success has
occurred in spite of the view of many that message passing is difficult and
that other approaches, including automatic parallelization and directive-based
parallelism, are easier to use. This paper argues that MPI has succeeded
because it addresses all of the important issues in providing a parallel
programming model.Comment: 12 pages, 1 figur
An Evaluation of the X10 Programming Language
As predicted by Moore\u27s law, the number of transistors on a chip has been doubled approximately every two years. As miraculous as it sounds, for many years, the extra transistors have massively benefited the whole computer industry, by using the extra transistors to increase CPU clock speed, thus boosting performance. However, due to heat wall and power constraints, the clock speed cannot be increased limitlessly. Hardware vendors now have to take another path other than increasing clock speed, which is to utilize the transistors to increase the number of processor cores on each chip. This hardware structural change presents inevitable challenges to software structure, where single thread targeted software will not benefit from newer chips or may even suffer from lower clock speed. The two fundamental challenges are: 1. How to deal with the stagnation of single core clock speed and cache memory. 2. How to utilize the additional power generated from more cores on a chip. Most software programming languages nowadays have distributed computing support, such as C and Java [1]. Meanwhile, some new programming languages were invented from scratch just to take advantage of the more distributed hardware structures. The X10 Programming Language is one of them. The goal of this project is to evaluate X10 in terms of performance, programmability and tool support
- …