Search CORE

1,435 research outputs found

Solution of partial differential equations on vector and parallel computers

Author: Ortega J. M.
Voigt R. G.
Publication venue
Publication date
Field of study

The present status of numerical methods for partial differential equations on vector and parallel computers was reviewed. The relevant aspects of these computers are discussed and a brief review of their development is included, with particular attention paid to those characteristics that influence algorithm selection. Both direct and iterative methods are given for elliptic equations as well as explicit and implicit methods for initial boundary value problems. The intent is to point out attractive methods as well as areas where this class of computer architecture cannot be fully utilized because of either hardware restrictions or the lack of adequate algorithms. Application areas utilizing these computers are briefly discussed

NASA Technical Reports Server

On the impact of communication complexity in the design of parallel numerical algorithms

Author: Gannon D.
Vanrosendale J.
Publication venue
Publication date
Field of study

This paper describes two models of the cost of data movement in parallel numerical algorithms. One model is a generalization of an approach due to Hockney, and is suitable for shared memory multiprocessors where each processor has vector capabilities. The other model is applicable to highly parallel nonshared memory MIMD systems. In the second model, algorithm performance is characterized in terms of the communication network design. Techniques used in VLSI complexity theory are also brought in, and algorithm independent upper bounds on system performance are derived for several problems that are important to scientific computation

NASA Technical Reports Server

General method of synthesis by PLIC/FPGA digital devices to perform discrete orthogonal transformations

Author: Ismagilov I.I.
Khasanova S.F.
Shalagin S.V.
Zakharov V.M.
Publication venue: 'African Journals Online (AJOL)'
Publication date: 17/01/2018
Field of study

A general method is proposed to synthesize digital devices in order to perform discrete orthogonal transformations (DOT) on programmable logic integrated circuits (PLIC) of FPGA class. The basic and the most "slow" operation during DOT performance is the operation of multiplying by a constant factor (constant) - OMC. To perform DOT digital devices are implemented at the use of the same type of IP-cores, which allow to realize OMC. According to the proposed method, OMC is determined on the basis of picturing set over the elements of the Galois field. Due to the distributed computing of nonlinear polynomial function systems defined over the Galois field in PLIC/FPGA architecture, the reduction in the estimates of time complexity concerning OMC performance is achieved. Each non-linear polynomial function, like OMC, is realized on the basis of the same type of IP-cores according to one of the structural schemes in accordance with the requirements for the device to perform DOT. The use of IP cores significantly reduces the cost of designing a device that implements DOT in the PLIC/FPGA architecture.Keywords: digital signal processing, discrete orthogonal transformations, distributed computing, nonlinear polynomial functions, Galois fields, FPGAs, digital device

AJOL - African Journals Online

Gigaflop performance on a CRAY-2: Multitasking a computational fluid dynamics application

Author: Lambiotte Jules J.
Overman Andrea L.
Streett Craig L.
Tennille Geoffrey M.
Publication venue
Publication date
Field of study

The methodology is described for converting a large, long-running applications code that executed on a single processor of a CRAY-2 supercomputer to a version that executed efficiently on multiple processors. Although the conversion of every application is different, a discussion of the types of modification used to achieve gigaflop performance is included to assist others in the parallelization of applications for CRAY computers, especially those that were developed for other computers. An existing application, from the discipline of computational fluid dynamics, that had utilized over 2000 hrs of CPU time on CRAY-2 during the previous year was chosen as a test case to study the effectiveness of multitasking on a CRAY-2. The nature of dominant calculations within the application indicated that a sustained computational rate of 1 billion floating-point operations per second, or 1 gigaflop, might be achieved. The code was first analyzed and modified for optimal performance on a single processor in a batch environment. After optimal performance on a single CPU was achieved, the code was modified to use multiple processors in a dedicated environment. The results of these two efforts were merged into a single code that had a sustained computational rate of over 1 gigaflop on a CRAY-2. Timings and analysis of performance are given for both single- and multiple-processor runs

NASA Technical Reports Server

General Purpose Computation on Graphics Processing Units Using OpenCL

Author: Fiaz Gul Khan
Publication venue
Publication date: 01/01/2013
Field of study

Computational Science has emerged as a third pillar of science along with theory and experiment, where the parallelization for scientific computing is promised by different shared and distributed memory architectures such as, super-computer systems, grid and cluster based systems, multi-core and multiprocessor systems etc. In the recent years the use of GPUs (Graphic Processing Units) for General purpose computing commonly known as GPGPU made it an exciting addition to high performance computing systems (HPC) with respect to price and performance ratio. Current GPUs consist of several hundred computing cores arranged in streaming multi-processors so the degree of parallelism is promising. Moreover with the development of new and easy to use interfacing tools and programming languages such as OpenCL and CUDA made the GPUs suitable for different computation demanding applications such as micromagnetic simulations. In micromagnetic simulations, the study of magnetic behavior at very small time and space scale demands a huge computation time, where the calculation of magnetostatic field with complexity of O(Nlog(N)) using FFT algorithm for discrete convolution is the main contribution towards the whole simulation time, and it is computed many times at each time step interval. This study and observation of magnetization behavior at sub-nanosecond time-scales is crucial to a number of areas such as magnetic sensors, non volatile storage devices and magnetic nanowires etc. Since micromagnetic codes in general are suitable for parallel programming as it can be easily divided into independent parts which can run in parallel, therefore current trend for micromagnetic code concerns shifting the computationally intensive parts to GPUs. My PhD work mainly focuses on the development of highly parallel magnetostatic field solver for micromagnetic simulators on GPUs. I am using OpenCL for GPU implementation, with consideration that it is an open standard for parallel programming of heterogeneous systems for cross platform. The magnetostatic field calculation is dominated by the multidimensional FFTs (Fast Fourier Transform) computation. Therefore i have developed the specialized OpenCL based 3D-FFT library for magnetostatic field calculation which made it possible to fully exploit the zero padded input data with out transposition and symmetries inherent in the field calculation. Moreover it also provides a common interface for different vendors' GPUs. In order to fully utilize the GPUs parallel architecture the code needs to handle many hardware specific technicalities such as coalesced memory access, data transfer overhead between GPU and CPU, GPU global memory utilization, arithmetic computation, batch execution etc. In the second step to further increase the level of parallelism and performance, I have developed a parallel magnetostatic field solver on multiple GPUs. Utilizing multiple GPUs avoids dealing with many of the limitations of GPUs (e.g., on-chip memory resources) by exploiting the combined resources of multiple on board GPUs. The GPU implementation have shown an impressive speedup against equivalent OpenMp based parallel implementation on CPU, which means the micromagnetic simulations which require weeks of computation on CPU now can be performed very fast in hours or even in minutes on GPUs. In parallel I also worked on ordered queue management on GPUs. Ordered queue management is used in many applications including real-time systems, operating systems, and discrete event simulations. In most cases, the efficiency of an application itself depends on usage of a sorting algorithm for priority queues. Lately, the usage of graphic cards for general purpose computing has again revisited sorting algorithms. In this work i have presented the analysis of different sorting algorithms with respect to sorting time, sorting rate and speedup on different GPU and CPU architectures and provided a new sorting technique on GPU

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

Fast fourier transforms and the Fujitsu VPP300

Author: Keating Geoffrey
Publication venue
Publication date: 30/08/2018
Field of study

The Australian National University

Algorithmic Contributions to the Theory of Regular Chains

Author: Dr. Marc
Dr. Mark Daley
Moreno Maza
Wei Pan
Publication venue: Scholarship@Western
Publication date: 25/01/2011
Field of study

Regular chains, introduced about twenty years ago, have emerged as one of the major tools for solving polynomial systems symbolically. In this thesis, we focus on different algorithmic aspects of the theory of regular chains, from theoretical questions to high- performance implementation issues. The inclusion test for saturated ideals is a fundamental problem in this theory. By studying the primitivity of regular chains, we show that a regular chain generates its saturated ideal if and only if it is primitive. As a result, a family of inclusion tests can be detected very efficiently. The algorithm to compute the regular GCDs of two polynomials modulo a regular chain is one of the key routines in the various triangular decomposition algorithms. By revisiting relations between subresultants and GCDs, we proposed a novel bottom-up algorithm for this task, which improves the previous algorithm in a significant manner and creates opportunities for parallel execution. This thesis also discusses the accelerations towards fast Fourier transform (FFT) over finite fields and FFT based subresultant chain constructions in the context of massively parallel GPU architectures, which speedup our algorithms by several orders of magnitude

CiteSeerX

Scholarship@Western

Theory and realization of novel algorithms for random sampling in digital signal processing

Author: Lo King Chuen
Publication venue
Publication date: 01/01/1996
Field of study

Random sampling is a technique which overcomes the alias problem in regular sampling. The randomization, however, destroys the symmetry property of the transform kernel of the discrete Fourier transform. Hence, when transforming a randomly sampled sequence to its frequency spectrum, the Fast Fourier transform cannot be applied and the computational complexity is N(^2). The objectives of this research project are (1) To devise sampling methods for random sampling such that computation may be reduced while the anti-alias property of random sampling is maintained : Two methods of inserting limited regularities into the randomized sampling grids are proposed. They are parallel additive random sampling and hybrid additive random sampling, both of which can save at least 75% of the multiplications required. The algorithms also lend themselves to the implementation by a multiprocessor system, which will further enhance the speed of the evaluation. (2) To study the auto-correlation sequence of a randomly sampled sequence as an alternative means to confirm its anti-alias property : The anti-alias property of the two proposed methods can be confirmed by using convolution in the frequency domain. However, the same conclusion is also reached by analysing in the spatial domain the auto-correlation of such sample sequences. A technique to evaluate the auto-correlation sequence of a randomly sampled sequence with a regular step size is proposed. The technique may also serve as an algorithm to convert a randomly sampled sequence to a regularly spaced sequence having a desired Nyquist frequency. (3) To provide a rapid spectral estimation using a coarse kernel : The approximate method proposed by Mason in 1980, which trades the accuracy for the speed of the computation, is introduced for making random sampling more attractive. (4) To suggest possible applications for random and pseudo-random sampling : To fully exploit its advantages, random sampling has been adopted in measurement Random sampling is a technique which overcomes the alias problem in regular sampling. The randomization, however, destroys the symmetry property of the transform kernel of the discrete Fourier transform. Hence, when transforming a randomly sampled sequence to its frequency spectrum, the Fast Fourier transform cannot be applied and the computational complexity is N"^. The objectives of this research project are (1) To devise sampling methods for random sampling such that computation may be reduced while the anti-alias property of random sampling is maintained : Two methods of inserting limited regularities into the randomized sampling grids are proposed. They are parallel additive random sampling and hybrid additive random sampling, both of which can save at least 75% , of the multiplications required. The algorithms also lend themselves to the implementation by a multiprocessor system, which will further enhance the speed of the evaluation. (2) To study the auto-correlation sequence of a randomly sampled sequence as an alternative means to confirm its anti-alias property : The anti-alias property of the two proposed methods can be confirmed by using convolution in the frequency domain. However, the same conclusion is also reached by analysing in the spatial domain the auto-correlation of such sample sequences. A technique to evaluate the auto-correlation sequence of a randomly sampled sequence with a regular step size is proposed. The technique may also serve as an algorithm to convert a randomly sampled sequence to a regularly spaced sequence having a desired Nyquist frequency. (3) To provide a rapid spectral estimation using a coarse kernel : The approximate method proposed by Mason in 1980, which trades the accuracy for the speed of the computation, is introduced for making random sampling more attractive. (4) To suggest possible applications for random and pseudo-random sampling : To fully exploit its advantages, random sampling has been adopted in measurement instruments where computing a spectrum is either minimal or not required. Such applications in instrumentation are easily found in the literature. In this thesis, two applications in digital signal processing are introduced. (5) To suggest an inverse transformation for random sampling so as to complete a two-way process and to broaden its scope of application. Apart from the above, a case study of realizing in a transputer network the prime factor algorithm with regular sampling is given in Chapter 2 and a rough estimation of the signal-to-noise ratio for a spectrum obtained from random sampling is found in Chapter 3. Although random sampling is alias-free, problems in computational complexity and noise prevent it from being adopted widely in engineering applications. In the conclusions, the criteria for adopting random sampling are put forward and the directions for its development are discussed

Durham e-Theses