18,634 research outputs found
Learning Parallel Computations with ParaLab
In this paper, we present the ParaLab teachware system, which can be used for learning the parallel computation methods. ParaLab provides the tools for simulating the multiprocessor computational systems with various network topologies, for carrying out the computational experiments in the simulation mode, and for evaluating the efficiency of the parallel computation methods. The visual presentation of the parallel computations taking place in the computational experiments is the key feature of the system. ParaLab can be used for the laboratory training within various teaching courses in the field of parallel, distributed, and supercomputer computations
Parallel software tools at Langley Research Center
This document gives a brief overview of parallel software tools available on the Intel iPSC/860 parallel computer at Langley Research Center. It is intended to provide a source of information that is somewhat more concise than vendor-supplied material on the purpose and use of various tools. Each of the chapters on tools is organized in a similar manner covering an overview of the functionality, access information, how to effectively use the tool, observations about the tool and how it compares to similar software, known problems or shortfalls with the software, and reference documentation. It is primarily intended for users of the iPSC/860 at Langley Research Center and is appropriate for both the experienced and novice user
Recommended from our members
Executing matrix multiply on a process oriented data flow machine
The Process-Oriented Dataflow System (PODS) is an execution model that combines the von Neumann and dataflow models of computation to gain the benefits of each. Central to PODS is the concept of array distribution and its effects on partitioning and mapping of processes.In PODS arrays are partitioned by simply assigning consecutive elements to each processing element (PE) equally. Since PODS uses single assignment, there will be only one producer of each element. This producing PE owns that element and will perform the necessary computations to assign it. Using this approach the filling loop is distributed across the PEs. This simple partitioning and mapping scheme provides excellent results for executing scientific code on MIMD machines. In this way PODS allows MIMD machines to exploit vector and data parallelism easily while still providing the flexibility of MIMD over SIMD for multi-user systems.In this paper, the classic matrix multiply algorithm, with 1024 data points, is executed on a PODS simulator and the results are presented and discussed. Matrix multiply is a good example because it has several interesting properties: there are multiple code-blocks; a new array must be dynamically allocated and distributed; there is a loop-carried dependency in the innermost loop; the two input arrays have different access patterns; and the sizes of the input arrays are not known at compile time. Matrix multiply also forms the basis for many important scientific algorithms such as: LU decomposition, convolution, and the Fast-Fourier Transform.The results show that PODS is comparable to both Iannucci's Hybrid Architecture and MIT's TTDA in terms of overhead and instruction power. They also show that PODS easily distributes the work load evenly across the PEs. The key result is that PODS can scale matrix multiply in a near linear fashion until there is little or no work to be performed for each PE. Then overhead and message passing become a major component of the execution time. With larger problems (e.g., >/=16k data points) this limit would be reached at around 256 PEs
AutoParallel: A Python module for automatic parallelization and distributed execution of affine loop nests
The last improvements in programming languages, programming models, and
frameworks have focused on abstracting the users from many programming issues.
Among others, recent programming frameworks include simpler syntax, automatic
memory management and garbage collection, which simplifies code re-usage
through library packages, and easily configurable tools for deployment. For
instance, Python has risen to the top of the list of the programming languages
due to the simplicity of its syntax, while still achieving a good performance
even being an interpreted language. Moreover, the community has helped to
develop a large number of libraries and modules, tuning them to obtain great
performance.
However, there is still room for improvement when preventing users from
dealing directly with distributed and parallel computing issues. This paper
proposes and evaluates AutoParallel, a Python module to automatically find an
appropriate task-based parallelization of affine loop nests to execute them in
parallel in a distributed computing infrastructure. This parallelization can
also include the building of data blocks to increase task granularity in order
to achieve a good execution performance. Moreover, AutoParallel is based on
sequential programming and only contains a small annotation in the form of a
Python decorator so that anyone with little programming skills can scale up an
application to hundreds of cores.Comment: Accepted to the 8th Workshop on Python for High-Performance and
Scientific Computing (PyHPC 2018
Mainstream parallel array programming on cell
We present the E] compiler and runtime library for the ‘F’ subset of
the Fortran 95 programming language. ‘F’ provides first-class support for arrays,
allowing E] to implicitly evaluate array expressions in parallel using the SPU coprocessors
of the Cell Broadband Engine. We present performance results from
four benchmarks that all demonstrate absolute speedups over equivalent ‘C’ or
Fortran versions running on the PPU host processor. A significant benefit of this
straightforward approach is that a serial implementation of any code is always
available, providing code longevity, and a familiar development paradigm
Parallelization of irregularly coupled regular meshes
Regular meshes are frequently used for modeling physical phenomena on both serial and parallel computers. One advantage of regular meshes is that efficient discretization schemes can be implemented in a straight forward manner. However, geometrically-complex objects, such as aircraft, cannot be easily described using a single regular mesh. Multiple interacting regular meshes are frequently used to describe complex geometries. Each mesh models a subregion of the physical domain. The meshes, or subdomains, can be processed in parallel, with periodic updates carried out to move information between the coupled meshes. In many cases, there are a relatively small number (one to a few dozen) subdomains, so that each subdomain may also be partitioned among several processors. We outline a composite run-time/compile-time approach for supporting these problems efficiently on distributed-memory machines. These methods are described in the context of a multiblock fluid dynamics problem developed at LaRC
Distributed memory compiler methods for irregular problems: Data copy reuse and runtime partitioning
Outlined here are two methods which we believe will play an important role in any distributed memory compiler able to handle sparse and unstructured problems. We describe how to link runtime partitioners to distributed memory compilers. In our scheme, programmers can implicitly specify how data and loop iterations are to be distributed between processors. This insulates users from having to deal explicitly with potentially complex algorithms that carry out work and data partitioning. We also describe a viable mechanism for tracking and reusing copies of off-processor data. In many programs, several loops access the same off-processor memory locations. As long as it can be verified that the values assigned to off-processor memory locations remain unmodified, we show that we can effectively reuse stored off-processor data. We present experimental data from a 3-D unstructured Euler solver run on iPSC/860 to demonstrate the usefulness of our methods
Vienna FORTRAN: A FORTRAN language extension for distributed memory multiprocessors
Exploiting the performance potential of distributed memory machines requires a careful distribution of data across the processors. Vienna FORTRAN is a language extension of FORTRAN which provides the user with a wide range of facilities for such mapping of data structures. However, programs in Vienna FORTRAN are written using global data references. Thus, the user has the advantage of a shared memory programming paradigm while explicitly controlling the placement of data. The basic features of Vienna FORTRAN are presented along with a set of examples illustrating the use of these features
From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation
Starting from a high-level problem description in terms of partial
differential equations using abstract tensor notation, the Chemora framework
discretizes, optimizes, and generates complete high performance codes for a
wide range of compute architectures. Chemora extends the capabilities of
Cactus, facilitating the usage of large-scale CPU/GPU systems in an efficient
manner for complex applications, without low-level code tuning. Chemora
achieves parallelism through MPI and multi-threading, combining OpenMP and
CUDA. Optimizations include high-level code transformations, efficient loop
traversal strategies, dynamically selected data and instruction cache usage
strategies, and JIT compilation of GPU code tailored to the problem
characteristics. The discretization is based on higher-order finite differences
on multi-block domains. Chemora's capabilities are demonstrated by simulations
of black hole collisions. This problem provides an acid test of the framework,
as the Einstein equations contain hundreds of variables and thousands of terms.Comment: 18 pages, 4 figures, accepted for publication in Scientific
Programmin
- …