2,035 research outputs found
Load Balancing for Parallel Loops in Workstation Clusters
Load imbalance is a serious impediment to achieving good performance in
parallel processing. Global load balancing schemes are not adequately
manage to balance parallel tasks generated from a single application.
Dynamic loop scheduling methods are known to be useful in balancing parallel
loops on shared-memory multiprocessor machines. However, their centralized
nature causes a bottleneck for the relatively small number of processors in
workstation clusters because of order-of-magnitude differences in
communications overheads. Moreover, improvements of basic loop scheduling
methods have not dealt effectively with irregularly distributed workloads in
parallel loops, which commonly occur in applications for workstation clusters.
In this paper, we present a new decentralized balancing method for parallel
loops on workstation clusters.
(Also cross-referenced as UMIACS-TR-96-6
Optimisation of a parallel ocean general circulation model
Abstract. This paper presents the development of a general-purpose parallel ocean circulation model, for use on a wide range of computer platforms, from traditional scalar machines to workstation clusters and massively parallel processors. Parallelism is provided, as a modular option, via high-level message-passing rou- tines, thus hiding the technical intricacies from the user. An initial implementation highlights that the parallel e?ciency of the model is adversely a?ected by a number of factors, for which optimisations are discussed and implemented. The resulting ocean code is portable and, in particular, allows science to be achieved on local workstations that could otherwise only be undertaken on state-of-the-art supercomputers
Designing a scalable dynamic load -balancing algorithm for pipelined single program multiple data applications on a non-dedicated heterogeneous network of workstations
Dynamic load balancing strategies have been shown to be the most critical part of an efficient implementation of various applications on large distributed computing systems. The need for dynamic load balancing strategies increases when the underlying hardware is a non-dedicated heterogeneous network of workstations (HNOW). This research focuses on the single program multiple data (SPMD) programming model as it has been extensively used in parallel programming for its simplicity and scalability in terms of computational power and memory size.;This dissertation formally defines and addresses the problem of designing a scalable dynamic load-balancing algorithm for pipelined SPMD applications on non-dedicated HNOW. During this process, the HNOW parameters, SPMD application characteristics, and load-balancing performance parameters are identified.;The dissertation presents a taxonomy that categorizes general load balancing algorithms and a methodology that facilitates creating new algorithms that can harness the HNOW computing power and still preserve the scalability of the SPMD application.;The dissertation devises a new algorithm, DLAH (Dynamic Load-balancing Algorithm for HNOW). DLAH is based on a modified diffusion technique, which incorporates the HNOW parameters. Analytical performance bound for the worst-case scenario of the diffusion technique has been derived.;The dissertation develops and utilizes an HNOW simulation model to conduct extensive simulations. These simulations were used to validate DLAH and compare its performance to related dynamic algorithms. The simulations results show that DLAH algorithm is scalable and performs well for both homogeneous and heterogeneous networks. Detailed sensitivity analysis was conducted to study the effects of key parameters on performance
Hierarchical Parallelisation of Functional Renormalisation Group Calculations -- hp-fRG
The functional renormalisation group (fRG) has evolved into a versatile tool
in condensed matter theory for studying important aspects of correlated
electron systems. Practical applications of the method often involve a high
numerical effort, motivating the question in how far High Performance Computing
(HPC) can leverage the approach. In this work we report on a multi-level
parallelisation of the underlying computational machinery and show that this
can speed up the code by several orders of magnitude. This in turn can extend
the applicability of the method to otherwise inaccessible cases. We exploit
three levels of parallelisation: Distributed computing by means of Message
Passing (MPI), shared-memory computing using OpenMP, and vectorisation by means
of SIMD units (single-instruction-multiple-data). Results are provided for two
distinct High Performance Computing (HPC) platforms, namely the IBM-based
BlueGene/Q system JUQUEEN and an Intel Sandy-Bridge-based development cluster.
We discuss how certain issues and obstacles were overcome in the course of
adapting the code. Most importantly, we conclude that this vast improvement can
actually be accomplished by introducing only moderate changes to the code, such
that this strategy may serve as a guideline for other researcher to likewise
improve the efficiency of their codes
The Parallel Implementation of a Full Configuration Interaction Program
Both the replicated and distributed data parallel full configuration interaction (FCI) implementations are described. The implementation of the FCI algorithm is organized in a hybrid strings-integral driven approach. Redundant communication is avoided, and the network performance is further optimized by an improved distributed data interface library. Examples show linear scalability of the distributed data code on both PC and workstation clusters. The new parallel implementation greatly extends the hardware on which parallel FCI calculations can be performed. The timing data on the workstation cluster show great potential for using the new parallel FCI algorithm in expanding applications of complete active space self-consistent field applications
Parallel Implementation of the PHOENIX Generalized Stellar Atmosphere Program
We describe the parallel implementation of our generalized stellar atmosphere
and NLTE radiative transfer computer program PHOENIX. We discuss the parallel
algorithms we have developed for radiative transfer, spectral line opacity, and
NLTE opacity and rate calculations. Our implementation uses a MIMD design based
on a relatively small number of MPI library calls. We report the results of
test calculations on a number of different parallel computers and discuss the
results of scalability tests.Comment: To appear in ApJ, 1997, vol 483. LaTeX, 34 pages, 3 Figures, uses
AASTeX macros and styles natbib.sty, and psfig.st
Monitorable network and CPU load statistics and their application to scheduling
Recent trends in high-speed computing have moved towards the use of networks of workstations as a cost-effective approach to parallel computing. One recently proposed solution involves the use of an existing network of workstation-class computers as a single multiprocessor, and much research is ongoing in this area;This dissertation describes work in the area of process scheduling on networks of workstations, specifically in the area of load analysis. After presenting extensive background in the field, measures of CPU and network load are defined, and a test parallel application program presented, written for a network-multiprocessing software package called PVM. A series of experiments is then detailed, whose goal was to discover the relationship between the run time of the test application and the loads on the participating workstations and networks. The experiments include measurement of CPU loading and network loading, both during test application runs, during artificially elevated loads, and during quiet conditions. Results of the experiments are presented, and the applications of the results to the problem of task scheduling examined. It is then claimed that several easily measured load measures are useful to task scheduling, by allowing run time to be predicted within a margin of error, and allowing limiting network segments to be detected and avoided
- …