304 research outputs found
The paradigm compiler: Mapping a functional language for the connection machine
The Paradigm Compiler implements a new approach to compiling programs written in high level languages for execution on highly parallel computers. The general approach is to identify the principal data structures constructed by the program and to map these structures onto the processing elements of the target machine. The mapping is chosen to maximize performance as determined through compile time global analysis of the source program. The source language is Sisal, a functional language designed for scientific computations, and the target language is Paris, the published low level interface to the Connection Machine. The data structures considered are multidimensional arrays whose dimensions are known at compile time. Computations that build such arrays usually offer opportunities for highly parallel execution; they are data parallel. The Connection Machine is an attractive target for these computations, and the parallel for construct of the Sisal language is a convenient high level notation for data parallel algorithms. The principles and organization of the Paradigm Compiler are discussed
Recommended from our members
Exploiting iteration-level parallelism in dataflow programs
The term "dataflow" generally encompasses three distinct aspects of computation - a data-driven model of computation, a functional/declarative programming language, and a special-purpose multiprocessor architecture. In this paper we decouple the language and architecture issues by demonstrating that declarative programming is a suitable vehicle for the programming of conventional distributed-memory multiprocessors.This is achieved by appling several transformations to the compiled declarative program to achieve iteration-level (rather than instruction-level) parallelism. The transformations first group individual instructions into sequential light-weight processes, and then insert primitives to: (1) cause array allocation to be distributed over multiple processors, (2) cause computation to follow the data distribution by inserting an index filtering mechanism into a given loop and spawning a copy of it on all PEs; the filter causes each instance of that loop to operate on a different subrange of the index variable.The underlying model of computation is a dataflow/von Neumann hybrid in that exection within a process is control-driven while the creation, blocking, and activation of processes is data-driven.The performance of this process-oriented dataflow system (PODS) is demonstrated using the hydrodynamics simulation benchmark called SIMPLE, where a 19-fold speedup on a 32-processor architecture has been achieved
Recommended from our members
Executing matrix multiply on a process oriented data flow machine
The Process-Oriented Dataflow System (PODS) is an execution model that combines the von Neumann and dataflow models of computation to gain the benefits of each. Central to PODS is the concept of array distribution and its effects on partitioning and mapping of processes.In PODS arrays are partitioned by simply assigning consecutive elements to each processing element (PE) equally. Since PODS uses single assignment, there will be only one producer of each element. This producing PE owns that element and will perform the necessary computations to assign it. Using this approach the filling loop is distributed across the PEs. This simple partitioning and mapping scheme provides excellent results for executing scientific code on MIMD machines. In this way PODS allows MIMD machines to exploit vector and data parallelism easily while still providing the flexibility of MIMD over SIMD for multi-user systems.In this paper, the classic matrix multiply algorithm, with 1024 data points, is executed on a PODS simulator and the results are presented and discussed. Matrix multiply is a good example because it has several interesting properties: there are multiple code-blocks; a new array must be dynamically allocated and distributed; there is a loop-carried dependency in the innermost loop; the two input arrays have different access patterns; and the sizes of the input arrays are not known at compile time. Matrix multiply also forms the basis for many important scientific algorithms such as: LU decomposition, convolution, and the Fast-Fourier Transform.The results show that PODS is comparable to both Iannucci's Hybrid Architecture and MIT's TTDA in terms of overhead and instruction power. They also show that PODS easily distributes the work load evenly across the PEs. The key result is that PODS can scale matrix multiply in a near linear fashion until there is little or no work to be performed for each PE. Then overhead and message passing become a major component of the execution time. With larger problems (e.g., >/=16k data points) this limit would be reached at around 256 PEs
Recommended from our members
Computer-aided programming for multiprocessing systems
As both the number of processors and the complexity of problems to be solved increase, programming multiprocessing systems becomes more difficult and error-prone. This report discusses parallel models of computation and tools for computer-aided programming (CAP). Program development tools are necessary since programmers are not able to develop complex parallel programs efficiently. In particular, a CAP tool, named Hypertool, is described here. It performs scheduling and handles the communication primitive insertion automatically so that many errors are eliminated. It also generates the performance estimates and other program quality measures to help programmers in improving their algorithms and programs. Experiments have shown that up to a 300% performance improvement can be achieved by computer-aided programming
Recommended from our members
Program allocation for hypercube based dataflow systems
The dataflow model of computation differs from the traditional control-flow
model of computation in that it does not utilize a program counter to sequence
instructions in a program. Instead, the execution of instructions is based solely on the
availability of their operands. Thus, an instruction is executed in a dataflow computer
when all of its operands are available. This asynchronous nature of the dataflow model of
computation allows the exploitation of fine-grain parallelism inherent in programs.
Although the dataflow model of computation exploits parallelism, the problem of
optimally allocating a program to processors belongs to the class of NP-complete
problems. Therefore, one of the major issues facing designers of dataflow
multiprocessors is the proper allocation of programs to processors.
The problem of program allocation lies in maximizing parallelism while
minimizing interprocessor communication costs. The culmination of research in the area
of program allocation has produced the proposed method called the Balanced Layered
Allocation Scheme that utilizes heuristic rules to strike a balance between computation
time and communication costs in dataflow multiprocessors. Specifically, the proposed
allocation scheme utilizes Critical Path and Longest Directed Path heuristics when
allocating instructions to processors. Simulation studies indicate that the proposed
scheme is effective in reducing the overall execution time of a program by considering
the effects of communication costs on computation times
An intelligent allocation algorithm for parallel processing
The problem of allocating nodes of a program graph to processors in a parallel processing architecture is considered. The algorithm is based on critical path analysis, some allocation heuristics, and the execution granularity of nodes in a program graph. These factors, and the structure of interprocessor communication network, influence the allocation. To achieve realistic estimations of the executive durations of allocations, the algorithm considers the fact that nodes in a program graph have to communicate through varying numbers of tokens. Coarse and fine granularities have been implemented, with interprocessor token-communication duration, varying from zero up to values comparable to the execution durations of individual nodes. The effect on allocation of communication network structures is demonstrated by performing allocations for crossbar (non-blocking) and star (blocking) networks. The algorithm assumes the availability of as many processors as it needs for the optimal allocation of any program graph. Hence, the focus of allocation has been on varying token-communication durations rather than varying the number of processors. The algorithm always utilizes as many processors as necessary for the optimal allocation of any program graph, depending upon granularity and characteristics of the interprocessor communication network
Dataflow development of medium-grained parallel software
PhD ThesisIn the 1980s, multiple-processor computers (multiprocessors) based on conven-
tional processing elements emerged as a popular solution to the continuing demand
for ever-greater computing power. These machines offer a general-purpose parallel
processing platform on which the size of program units which can be efficiently
executed in parallel - the "grain size" - is smaller than that offered by distributed
computing environments, though greater than that of some more specialised
architectures. However, programming to exploit this medium-grained parallelism
remains difficult. Concurrent execution is inherently complex, yet there is a lack of
programming tools to support parallel programming activities such as program
design, implementation, debugging, performance tuning and so on.
In helping to manage complexity in sequential programming, visual tools have
often been used to great effect, which suggests one approach towards the goal of
making parallel programming less difficult.
This thesis examines the possibilities which the dataflow paradigm has to offer
as the basis for a set of visual parallel programming tools, and presents a dataflow
notation designed as a framework for medium-grained parallel programming. The
implementation of this notation as a programming language is discussed, and its
suitability for the medium-grained level is examinedScience and Engineering Research Council of Great Britain
EC ERASMUS schem
- …