Search CORE

111 research outputs found

Recommended from our members

Piece-wise scheduling of composite task graphs onto distributed memory parallel computers

Author: Lewis Ted G.
Publication venue: Oregon State University. Department of Computer Science
Publication date
Field of study

Heuristics for static scheduling of task graphs using list scheduling techniques have continued to improve by adding real-world factors such as processor speed, network transmission speed, interconnection topology, and link contention considerations to the basic task graph model. Yet, the resulting schedules do not fully model program loops and branches, startup costs for both process creation and message initiation, and a number of interesting parallel processing patterns such as meshes, tress, and supervisor/workers. In fact, improvements in the schedule may be obtained when the task graph is regular as when it contains repeated or replicated tasks, divide-and-conquer patterns of communication, or a mesh-structured pattern of computation. In this paper we describe a limited approach to scheduling composite task graphs that considers process and message startup costs, and three regular patterns : replicated, tree, and mesh. The approach is to model programs with such regular patterns as a composite task graph, where each regular structure is a decomposable sub-task node in the task graph. Then, we compute an optimal schedule for each sub-task. graph, piece the sub-tasks together, and perform an ordinary static scheduling heuristic on the pieces, to produce an overall schedule. We define a composite task graph as a hierarchical task graph containing regular-structured sub-task graphs as components. At the top level of this hierarchy, each graph node represents either a simple task or a hierarchically decomposable sub-task graph. We propose a piece-wise scheduling algorithm that simply allocates processors to sub-task graphs according to closed-form expressions which give determine the optimal number of processors, and then uses a list scheduling algorithm to schedule the flattened graph onto these processors. We do not address the pressing problem of loops and branches in the task graph representation, but we speculate that the technique of piece-wise scheduling introduced here can be adapted to a hybrid form of scheduling that may accommodate branches and loops. Piece-wise scheduling is not guaranteed to yield the best global schedule. Rather, it pieces together locally optimum sub-schedules. Finding globally optimum schedules for composite task graphs remains an open problem. We present an heuristic approach that has been experimentally used to schedule small parallel programs with encouraging results. More empirical evidence is needed to determine the usefulness of this technique, but early indications are encouraging

ScholarsArchive@OSU

Structured Parallel Programming and Cache Coherence in Multicore Architectures

Author: LAMETTI SILVIA
Publication venue: 'Pisa University Press'
Publication date: 09/12/2015
Field of study

It is clear that multicore processors have become the building blocks of today’s high-performance computing platforms. The advent of massively parallel single-chip microprocessors further emphasizes the gap that exists between parallel architectures and parallel programming maturity. Our research group, starting from the experiences on distributed and shared memory multiprocessor, was one of the first to propose a Structured Parallel Programming approach to bridge this gap. In this scenario, one of the biggest problems is that an application’s performance is often affected by the sharing pattern of data and its impact on Cache Coherence. Currently multicore platforms rely on hardware or automatic cache coherence techniques that allow programmers to develop programs without taking into account the problem. It is well known that standard coherency protocols are inefficient for certain data communication patterns and these inefficiencies will be amplified by the increased core number and the complex memory hierarchies. Following a structured parallelism approach, our methodology to attack these problems is based on two interrelated issues: structured parallelism paradigms and cost models (or performance models). Evaluating the performance of a program, although widely studied, is still an open problem in the research community and, notably, specific cost models to de- scribe multicores are missing. For this reason in this thesis, we define an abstract model for cache coherent architectures, which is able to capture the essential elements and the qualitative behaviors of multicore-based systems. Furthermore, we show how this abstract model combined with well known performance modelling techniques, such as analytical modelling (e.g., queueing models and stochastic process algebras) or simulations, provide an application- and architecture-dependent cost model to predict structured parallel applications performances. Starting out from the behavior and performance predictability of structured parallelism schemes, in this thesis we address the issue of cache coherence in multicore architectures, following an algorithm-dependent approach, a particular kind of software cache coherence solution characterized by explicit cache management strategies, which are specific of the algorithm to be executed. Notably, we ensure parallel correctness by exploiting architecture-specific mechanisms and by defining proper data structures in order to “emulate” cache coherence solutions in an efficient way for each computation. Algorithm-dependent cache coherence can be efficiently implemented at the support level of structured parallelism paradigms, with absolute transparency with respect to the application programmer. Moreover, by using the cost model, in this thesis we study and compare different algorithm-dependent implementations, such as those based on automatic cache coherence with respect to an original, non-automatic and lock-free solution based on interprocessor communications. Notably, with this latter implementation, in some cases, we are able to reduce the number of memory accesses, cache transfers and synchronizations and increasing computation parallelism with respect to the use of automatic cache coherence. Current architectures do not usually allow disabling automatic cache coherence. However, the emergence of many-core architectures somewhat changed the scenario, so that some architectures, such as the Tilera TilePro64, allow to control and disable the automatic cache coherence facilities. For this reason, in this thesis we finally apply our methodology to TilePro64 platform in order provide a further validation of the results obtained by our cost model

Electronic Thesis and Dissertation Archive - Università di Pisa

System software for the finite element machine

Author: Crockett T. W.
Knott J. D.
Publication venue
Publication date
Field of study

The Finite Element Machine is an experimental parallel computer developed at Langley Research Center to investigate the application of concurrent processing to structural engineering analysis. This report describes system-level software which has been developed to facilitate use of the machine by applications researchers. The overall software design is outlined, and several important parallel processing issues are discussed in detail, including processor management, communication, synchronization, and input/output. Based on experience using the system, the hardware architecture and software design are critiqued, and areas for further work are suggested

NASA Technical Reports Server

Investigations into the suitability of parallel computing architectures for the solution of large sparse matrices using the preconditioned conjugate gradient method

Author: El-Ghajiji Otman Abubaker
Publication venue
Publication date: 01/01/1995
Field of study

OPUS

Integration of tools for the Design and Assessment of High-Performance, Highly Reliable Computing Systems (DAHPHRS), phase 1

Author: Baker R.
Frank G.
Gray G.
Scheper C.
Yalamanchili S.
Publication venue
Publication date
Field of study

Systems for Space Defense Initiative (SDI) space applications typically require both high performance and very high reliability. These requirements present the systems engineer evaluating such systems with the extremely difficult problem of conducting performance and reliability trade-offs over large design spaces. A controlled development process supported by appropriate automated tools must be used to assure that the system will meet design objectives. This report describes an investigation of methods, tools, and techniques necessary to support performance and reliability modeling for SDI systems development. Models of the JPL Hypercubes, the Encore Multimax, and the C.S. Draper Lab Fault-Tolerant Parallel Processor (FTPP) parallel-computing architectures using candidate SDI weapons-to-target assignment algorithms as workloads were built and analyzed as a means of identifying the necessary system models, how the models interact, and what experiments and analyses should be performed. As a result of this effort, weaknesses in the existing methods and tools were revealed and capabilities that will be required for both individual tools and an integrated toolset were identified

NASA Technical Reports Server

Static allocation of computation to processors in multicomputers

Author: Norman Michael G.
Publication venue: The University of Edinburgh
Publication date: 01/01/1993
Field of study

Edinburgh Research Archive

Parallel solution of power system linear equations

Author: Grey David John
Publication venue
Publication date: 01/01/1995
Field of study

At the heart of many power system computations lies the solution of a large sparse set of linear equations. These equations arise from the modelling of the network and are the cause of a computational bottleneck in power system analysis applications. Efficient sequential techniques have been developed to solve these equations but the solution is still too slow for applications such as real-time dynamic simulation and on-line security analysis. Parallel computing techniques have been explored in the attempt to find faster solutions but the methods developed to date have not efficiently exploited the full power of parallel processing. This thesis considers the solution of the linear network equations encountered in power system computations. Based on the insight provided by the elimination tree, it is proposed that a novel matrix structure is adopted to allow the exploitation of parallelism which exists within the cutset of a typical parallel solution. Using this matrix structure it is possible to reduce the size of the sequential part of the problem and to increase the speed and efficiency of typical LU-based parallel solution. A method for transforming the admittance matrix into the required form is presented along with network partitioning and load balancing techniques. Sequential solution techniques are considered and existing parallel methods are surveyed to determine their strengths and weaknesses. Combining the benefits of existing solutions with the new matrix structure allows an improved LU-based parallel solution to be derived. A simulation of the improved LU solution is used to show the improvements in performance over a standard LU-based solution that result from the adoption of the new techniques. The results of a multiprocessor implementation of the method are presented and the new method is shown to have a better performance than existing methods for distributed memory multiprocessors

Durham e-Theses

OpenGrey Repository

An investigation into Multiprocessor Systems based on UNIX

Author: Welten P.J.M.
Publication venue
Publication date: 28/02/1989
Field of study

Pure OAI Repository

Probabilistic structural mechanics research for parallel processing computers

Author: Chen Heh-Chyun
Martin William R.
Sues Robert H.
Twisdale Lawrence A.
Publication venue
Publication date
Field of study

Aerospace structures and spacecraft are a complex assemblage of structural components that are subjected to a variety of complex, cyclic, and transient loading conditions. Significant modeling uncertainties are present in these structures, in addition to the inherent randomness of material properties and loads. To properly account for these uncertainties in evaluating and assessing the reliability of these components and structures, probabilistic structural mechanics (PSM) procedures must be used. Much research has focused on basic theory development and the development of approximate analytic solution methods in random vibrations and structural reliability. Practical application of PSM methods was hampered by their computationally intense nature. Solution of PSM problems requires repeated analyses of structures that are often large, and exhibit nonlinear and/or dynamic response behavior. These methods are all inherently parallel and ideally suited to implementation on parallel processing computers. New hardware architectures and innovative control software and solution methodologies are needed to make solution of large scale PSM problems practical

NASA Technical Reports Server

Task assignment in parallel processor systems

Author: Manoharan Sathiamoorthy
Publication venue: The University of Edinburgh
Publication date: 01/01/1993
Field of study

A generic object-oriented simulation platform is developed in order to conduct experiments on the performance of assignment schemes. The simulation platform, called Genesis, is generic in the sense that it can model the key parameters that describe a parallel system: the architecture, the program, the assignment scheme and the message routing strategy. Genesis uses as its basis a sound architectural representation scheme developed in the thesis. The thesis reports results from a number of experiments assessing the performance of assignment schemes using Genesis. The comparison results indicate that the new assignment scheme proposed in this thesis is a promising alternative to the work-greedy assignment schemes. The proposed scheme has a time-complexity less than those of the work-greedy schemes and achieves an average performance better than, or comparable to, those of the work-greedy schemes. To generate an assignment, some parameters describing the program model will be required. In many cases, accurate estimation of these parameters is hard. It is thought that inaccuracies in the estimation would lead to poor assignments. The thesis investigates this speculation and presents experimental evidence that shows such inaccuracies do not greatly affect the quality of the assignments

Edinburgh Research Archive