19 research outputs found
Supporting nested parallelism
Many parallel applications do not completely fit into the data parallel model. Although these applications contain data parallelism, task parallelism is needed to represent the natural computation structure or enhance performance. To combine the easiness of programming of the data parallel model with the efficiency of the task parallel model allows to parallel forms to be nested, giving Nested parallelism.
In this work, we examine the solutions provided to N ested parallelism in two standard parallel programming platforms, HPF and MPI. Both their expression capacity and their efficiency are compared on a Cray- 3TE, which is distributed memory machine. Finally, an additional speech about the use of the methodology proposed for MPI is done on two different architecturesI Workshop de Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI
Solving parallel problems by OTMP model
Since the early stages of parallel computing, one of the most common solutions to in troduce parallelism has been to extend a sequential language with some sort of parallel version of the for construct, commonly denoted as forall construct. Although similar syntax, these forall loops di er in their semantics and implementations. The High Performance Fortran (HPF) and OpenMP versions are, likely, among the most popular. This paper presents yet another forall loop extension for the C language. In this work, we introduce a parallel computation model: One Thread Multiple Processor Model (OTMP). This model proposes an abstract machine, a programming model and cost model. The programming model de nes another forall loop construct, the theorical machine aims for both homogeneous shared and distributed memory computers, and the cost model allo ws the prediction of the performance of a program. OTMP does not only in tegrates and extends sequential programming, but also includes and expands the message passing programming model. The model allows and exploits any nested levels of parallelism, taking advan tage of situations where there are several small nested loops.Eje: Procesamiento distribuido y paralelo (PDP)Red de Universidades con Carreras en Informática (RedUNCI
Nested parallelism in parallels paradigms
Data parallelislm is one of the more successful efforts to introduce explicit parallelism to high level programming languages. This approach is taken because many useful computations can be framed in term of a set of independent sub-computation, each strongly associated with an element of a large data structure. Such computations are inherently parallelizable. Dala parallell programming is particularly convenient for two reasons. The first is its case of programming and the second is that it can scale easily to larger problem sizes. Several data parallel language implementations now exists. However, almost all discussion of data parallelism were limited to the simplest and least expressive form: unstructured data parallelism (flat).
Many other generalizalions of the data parallel model have been proposed, which permit the nesling of data parallel constructors to specify parallel computation across nested and irregular data structures. These language implementations include the capability of nesled parallel invocations, combining the facility of programming on a data parallel model with the efficiency in the execution on irregular data structures of the task parellel model.Eje: Procesamiento Concurrente, paralelo y distribuido. RedesRed de Universidades con Carreras en Informática (RedUNCI
Nested parallelism in parallels paradigms
Data parallelislm is one of the more successful efforts to introduce explicit parallelism to high level programming languages. This approach is taken because many useful computations can be framed in term of a set of independent sub-computation, each strongly associated with an element of a large data structure. Such computations are inherently parallelizable. Dala parallell programming is particularly convenient for two reasons. The first is its case of programming and the second is that it can scale easily to larger problem sizes. Several data parallel language implementations now exists. However, almost all discussion of data parallelism were limited to the simplest and least expressive form: unstructured data parallelism (flat).
Many other generalizalions of the data parallel model have been proposed, which permit the nesling of data parallel constructors to specify parallel computation across nested and irregular data structures. These language implementations include the capability of nesled parallel invocations, combining the facility of programming on a data parallel model with the efficiency in the execution on irregular data structures of the task parellel model.Eje: Procesamiento Concurrente, paralelo y distribuido. RedesRed de Universidades con Carreras en Informática (RedUNCI
Analisis and tools for performance prediction
We present an analytical model that extends BSP to cover both oblivious synchronization and group partitioning. There are a few oversimplifications in BSP that make difficult to have accurate predictions.
Even if the numbers of individual communication or computation operations in two stages are the same, the actual times for these two stages may differ. These differences are due to the separate nature of the operations or to the particular pattern followed by the messages. Even worse, the assumption that a constant number of machine instructions takes constant time is far from the truth.
Current memory hierarchies imply that memory access vary from a few cycles to several thousands. A natural proposal is to associate a different proportionality constant with each basic block, and analogously, to associate different latencies and bandwidths with each “communication block”.
Unfortunately, to use this approach implies that the evaluation parameters not only depend on given architecture, but also reflect algorithm characteristics.
Such parameter evaluation must be done for every algorithm. This is a heavy task, implying experiment design, timing, statistics, pattern recognition and multi-parameter fitting algorithms. Software support is required. We have developed a compiler that takes as source a C program annotated with complexity formulas and produces as output an instrumented code. The trace files obtained from the execution of the resulting code are analyzed with an interactive interpreter, giving us, among other information, the values of those parameters.Eje: Programación concurrenteRed de Universidades con Carreras en Informática (RedUNCI
Supporting nested parallelism
Many parallel applications do not completely fit into the data parallel model. Although these applications contain data parallelism, task parallelism is needed to represent the natural computation structure or enhance performance. To combine the easiness of programming of the data parallel model with the efficiency of the task parallel model allows to parallel forms to be nested, giving Nested parallelism.
In this work, we examine the solutions provided to N ested parallelism in two standard parallel programming platforms, HPF and MPI. Both their expression capacity and their efficiency are compared on a Cray- 3TE, which is distributed memory machine. Finally, an additional speech about the use of the methodology proposed for MPI is done on two different architecturesI Workshop de Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI
Collective computing
The parallel computing model used in this paper, the Collective Computing Model (CCM), is a variant of the well-known Bulk Synchronous Parallel (BSP) model. The synchronicity imposed by the BSP model restricts the set of available algorithms and prevents the overlapping of computation and communication. Other models, like the LogP model, allow asynchronous computing and overlapping but depend on the use of specific libraries. The CCM describes a system exploited through a standard software platform providing facilities for group creation, collective operations and remote memory operations. Based in the BSP model, two kinds of supersteps are considered: division supersteps and normal supersteps. To illustrate these concepts, the Fast Fourier Transform Algorithm is used. Computational results prove the accuracy of the model in four different parallel computers: a Parsytec Power PC, a Cray T3E, a Silicon Graphics Origin 2000 and a Digital Alpha Server.Eje: Disribución y tiempo realRed de Universidades con Carreras en Informática (RedUNCI
Solving parallel problems by OTMP model
Since the early stages of parallel computing, one of the most common solutions to in troduce parallelism has been to extend a sequential language with some sort of parallel version of the for construct, commonly denoted as forall construct. Although similar syntax, these forall loops di er in their semantics and implementations. The High Performance Fortran (HPF) and OpenMP versions are, likely, among the most popular. This paper presents yet another forall loop extension for the C language. In this work, we introduce a parallel computation model: One Thread Multiple Processor Model (OTMP). This model proposes an abstract machine, a programming model and cost model. The programming model de nes another forall loop construct, the theorical machine aims for both homogeneous shared and distributed memory computers, and the cost model allo ws the prediction of the performance of a program. OTMP does not only in tegrates and extends sequential programming, but also includes and expands the message passing programming model. The model allows and exploits any nested levels of parallelism, taking advan tage of situations where there are several small nested loops.Eje: Procesamiento distribuido y paralelo (PDP)Red de Universidades con Carreras en Informática (RedUNCI
Nested parallelism in parallels paradigms
Data parallelislm is one of the more successful efforts to introduce explicit parallelism to high level programming languages. This approach is taken because many useful computations can be framed in term of a set of independent sub-computation, each strongly associated with an element of a large data structure. Such computations are inherently parallelizable. Dala parallell programming is particularly convenient for two reasons. The first is its case of programming and the second is that it can scale easily to larger problem sizes. Several data parallel language implementations now exists. However, almost all discussion of data parallelism were limited to the simplest and least expressive form: unstructured data parallelism (flat).
Many other generalizalions of the data parallel model have been proposed, which permit the nesling of data parallel constructors to specify parallel computation across nested and irregular data structures. These language implementations include the capability of nesled parallel invocations, combining the facility of programming on a data parallel model with the efficiency in the execution on irregular data structures of the task parellel model.Eje: Procesamiento Concurrente, paralelo y distribuido. RedesRed de Universidades con Carreras en Informática (RedUNCI
The collective computing model
The parallel computing model used in this paper, the Collective Computing Model (CCM), is a variant of the well-known Bulk Synchronous Parallel (BSP) model. The synchronicity imposed by the BSP model restricts the set of available algorithms and prevents the overlapping of computation and communication. Other models, like the LogP model, allow asynchronous computing and overlapping but depend on the use of specific libraries. The CCM describes a system exploited through a standard software platform providing facilities for group creation, collective operations and remote memory operations. Based in the BSP model, two kinds of supersteps are considered: Division supersteps and Normal supersteps. The structure of divisions produced by the Division Functions and the partnership relation among processors give place to communication patterns among processors that are topologically similar to a hypercube. We have named the resulting structures Dynamic Polytopes To illustrate these concepts, the Fast Fourier Transform Algorithm is used. Computational results prove the accuracy of the model in four different parallel computers: a Parsytec Power PC, a Cray T3E, a Silicon Graphics Origin 2000 and a Digital Alpha Server.Facultad de Informátic