Search CORE

124 research outputs found

Numerics of High Performance Computers and Benchmark Evaluation of Distributed Memory Computers

Author: Krishna H. S.
Singh K. P.
Publication venue: 'Defence Scientific Information and Documentation Centre'
Publication date: 01/07/2004
Field of study

The internal representation of numerical data, their speed of manipulation to generate the desired result through efficient utilisation of central processing unit, memory, and communication links are essential steps of all high performance scientific computations. Machine parameters, in particular, reveal accuracy and error bounds of computation, required for performance tuning of codes. This paper reports diagnosis of machine parameters, measurement of computing power of several workstations, serial and parallel computers, and a component-wise test procedure for distributed memory computers. Hierarchical memory structure is illustrated by block copying and unrolling techniques. Locality of reference for cache reuse of data is amply demonstrated by fast Fourier transform codes. Cache and register-blocking technique results in their optimum utilisation with consequent gain in throughput during vector-matrix operations. Implementation of these memory management techniques reduces cache inefficiency loss, which is known to be proportional to the number of processors. Of the two Linux clusters-ANUP16, HPC22 and HPC64, it has been found from the measurement of intrinsic parameters and from application benchmark of multi-block Euler code test run that ANUP16 is suitable for problems that exhibit fine-grained parallelism. The delivered performance of ANUP16 is of immense utility for developing high-end PC clusters like HPC64 and customised parallel computers with added advantage of speed and high degree of parallelism

Defence Science Journal

Predictive analysis and optimisation of pipelined wavefront applications using reusable analytic models

Author: Mudalige Gihan R.
Publication venue
Publication date
Field of study

Pipelined wavefront computations are an ubiquitous class of high performance parallel algorithms used for the solution of many scientific and engineering applications. In order to aid the design and optimisation of these applications, and to ensure that during procurement platforms are chosen best suited to these codes, there has been considerable research in analysing and evaluating their operational performance. Wavefront codes exhibit complex computation, communication, synchronisation patterns, and as a result there exist a large variety of such codes and possible optimisations. The problem is compounded by each new generation of high performance computing system, which has often introduced a previously unexplored architectural trait, requiring previous performance models to be rewritten and reevaluated. In this thesis, we address the performance modelling and optimisation of this class of application, as a whole. This differs from previous studies in which bespoke models are applied to specific applications. The analytic performance models are generalised and reusable, and we demonstrate their application to the predictive analysis and optimisation of pipelined wavefront computations running on modern high performance computing systems. The performance model is based on the LogGP parameterisation, and uses a small number of input parameters to specify the particular behaviour of most wavefront codes. The new parameters and model equations capture the key structural and behavioural differences among different wavefront application codes, providing a succinct summary of the operations for each application and insights into alternative wavefront application design. The models are applied to three industry-strength wavefront codes and are validated on several systems including a Cray XT3/XT4 and an InfiniBand commodity cluster. Model predictions show high quantitative accuracy (less than 20% error) for all high performance configurations and excellent qualitative accuracy. The thesis presents applications, projections and insights for optimisations using the model, which show the utility of reusable analytic models for performance engineering of high performance computing codes. In particular, we demonstrate the use of the model for: (1) evaluating application configuration and resulting performance; (2) evaluating hardware platform issues including platform sizing, configuration; (3) exploring hardware platform design alternatives and system procurement and, (4) considering possible code and algorithmic optimisations

Warwick Research Archives Portal Repository

Recommended from our members

The simulation of fluid flow processes using vector processors

Author: Ierotheou Constantinos Savvas
Publication venue
Publication date: 01/05/1990
Field of study

In this thesis the potential gains in vectorisation of linear and non-linear systems of equations are investigated. Previous studies carried out on the suitability of algorithms for vectorisation have been based on the solution of Poisson's equation. In accordance with this, a range of algorithms are explored and compared using a VA-1 pipeline processor attached to a MASSCOMP MC5400. Analysis shows that almost full vectorisation is possible leading to speed-up factors of up to 90. Based on these results the vectorised conjugate gradient with a Jacobi preconditioner (JCGV) is the best of the algorithms considered. This work is extended to the development of a two-dimensional fluid flow code which is used to solve the Navier-Stokes equations, SIMPLE is implemented to handle the non-linear nature of the equations. The first two problems are isothermal flows, viz, the 'moving lid cavity' and the 'sudden expansion in a duct' problem. A study of where the greatest computational effort is expended, and subsequent vectorisation leads to 98% of SIMPLE being modified. This results in speed-up factors of 6 for the cavity problem and 29 for the sudden expansion problem. In both problems the JCGV is marginally faster than the vectorised Jacobi with under-relaxation (JURY). However, the JCGV algorithm is not robust and it is necessary to relax carefully the approximation, otherwise high computation times or divergence is likely. Two further problems are considered each with increasing complexity, these include scalar quantities of temperature and characteristics of k-e turbulence. One problem is based on 'turbulent L-shaped flow in a duct' and the other on the 'natural convection in a square cavity'. A consequence of the higher scalar computation gives speed-up factors of 5 for the turbulent L-shaped flow and 11 for the natural convection problem. There is little to choose between the JCGV and JURV algorithms, however, the robustness problems with the JCGV algorithm remain. A multigrid method (ACM) is used to improve the convergence rate of the algorithms, particularly as the size of problem is increased. Although it is more effective in scalar, it also provides worthwhile improvements for the vectorised algorithms with overall factors of 8.5. Convergence difficulties with the JCG algorithm also prevents the combination with the ACM method. Therefore, the vectorised JUR algorithm with the ACM method is not only more efficient and reliable, but also has scope for improvement as the grid is increased. The potential gains in vectorisation of the SIMPLE family on pipeline architectures have been clearly demonstrated and indicate that such efforts on practical CFD codes should be well rewarded with regard to processor performance

Greenwich Academic Literature Archive

Predictive analysis and optimisation of pipelined wavefront applications using reusable analytic models

Author: Mudalige Gihan Ravideva
Publication venue
Publication date: 01/01/2009
Field of study

OpenGrey Repository

Polygon-based hidden surface elimination algorithms: serial and parallel

Author: Julian C. Highfield (7170221)
Publication venue
Publication date: 01/01/1994
Field of study

Chapter 1 introduces the need for rapid solutions of hidden surface elimination (HSE) problems in the interactive display of objects and scenes, as used in many application areas such as flight and driving simulators and CAD systems. It reviews the existing approaches to high-performance computer graphics and to parallel computing. It then introduces the central tenet of this thesis: that general purpose parallel computers may be usefully applied to the solution of HSE problems. Finally it introduces a set of metrics for describing sets of scene data, and applies them to the test scenes used in this thesis. Chapter 2 describes variants of several common image space hidden surface elimination algorithms, which solve the HSE problem for scenes described as collections of polygons. Implementations of these HSE algorithms on a traditional, serial, single microprocessor computer are introduced and theoretical estimates of their performance are derived. The algorithms are compared under identical conditions for various sets of test data. The results of this comparison are then placed in context with existing historical results. Chapter 3 examines the application of MIMD style parallelism to accelerate the solution of HSE problems. MIMD parallel implementations of the previously considered HSE algorithms are introduced. Their behaviour under various system configurations and for various data sets is investigated and compared with theoretical estimates. The theoretical estimates are found to match closely the experimental findings. Chapter 4 summarises the conclusions of this thesis, finding that HSE algorithms can be implemented to use an MIMD parallel computer effectively, and that of the HSE algorithms examined the z-buffer algorithm generally proves to be a good compromise solution

Loughborough University Institutional Repository

Parallelisation of EST clustering

Author: Ranchod Pravesh
Publication venue
Publication date: 23/03/2006
Field of study

Master of Science - ScienceThe field of bioinformatics has been developing steadily, with computational problems related to biology taking on an increased importance as further advances are sought. The large data sets involved in problems within computational biology have dictated a search for good, fast approximations to computationally complex problems. This research aims to improve a method used to discover and understand genes, which are small subsequences of DNA. A difficulty arises because genes contain parts we know to be functional and other parts we assume are non-functional as there functions have not been determined. Isolating the functional parts requires the use of natural biological processes which perform this separation. However, these processes cannot read long sequences, forcing biologists to break a long sequence into a large number of small sequences, then reading these. This creates the computational difficulty of categorizing the short fragments according to gene membership. Expressed Sequence Tag Clustering is a technique used to facilitate the identification of expressed genes by grouping together similar fragments with the assumption that they belong to the same gene. The aim of this research was to investigate the usefulness of distributed memory parallelisation for the Expressed Sequence Tag Clustering problem. This was investigated empirically, with a distributed system tested for speed against a sequential one. It was found that distributed memory parallelisation can be very effective in this domain. The results showed a super-linear speedup for up to 100 processors, with higher numbers not tested, and likely to produce further speedups. The system was able to cluster 500000 ESTs in 641 minutes using 101 processors

Wits Institutional Repository on DSPACE

Achieving parallel performance in scientific computations

Author: Clarke Lyndon J.
Publication venue: The University of Edinburgh
Publication date: 01/01/1990
Field of study

Edinburgh Research Archive

Engineering the performance of parallel applications

Author: MacDonald Neil Blair
Publication venue: The University of Edinburgh
Publication date: 01/01/1996
Field of study

Edinburgh Research Archive

The development of a multi-layer architecture for image processing

Author: Fung Yu Fai
Publication venue: UCL (University College London)
Publication date: 01/01/1991
Field of study

The extraction of useful information from an image involves a series of operations, which can be functionally divided into low-level, intermediate-level and high- level processing. Because different amounts of computing power may be demanded by each level, a system which can simultaneously carry out operations at different levels is desirable. A multi-layer system which embodies both functional and spatial parallelism is envisioned. This thesis describes the development of a three-layer architecture which is designed to tackle vision problems embodying operations in each processing level. A survey of various multi-layer and multi-processor systems is carried out and a set of guidelines for the design of a multi-layer image processing system is established. The linear array is proposed as a possible basis for multi-layer systems and a significant part of the thesis is concerned with a study of this structure. The CLIP7A system, which is a linear array with 256 processing elements, is examined in depth. The CLIP7A system operates under SIMD control, enhanced by local autonomy. In order to examine the possible benefits of this arrangement, image processing algorithms which exploit the autonomous functions are implemented. Additionally, the structural properties of linear arrays are also studied. Information regarding typical computing requirements in each layer and the communication networks between elements in different layers is obtained by applying the CLIP7A system to solve an integrated vision problem. From the results obtained, a three layer architecture is proposed. The system has 256, 16 and 4 processing elements in the low, intermediate and high level layer respectively. The processing elements will employ a 16-bit microprocessor as the computing unit, which is selected from off-the-shelf components. Communication between elements in consecutive layers is via two different networks, which are designed so that efficient data transfer is achieved. Additionally, the networks enable the system to maintain fault tolerance and to permit expansion in the second and third layers

UCL Discovery