Search CORE

114 research outputs found

Efficient, massively parallel eigenvalue computation

Author: Huo Yan
Schreiber Robert
Publication venue
Publication date
Field of study

In numerical simulations of disordered electronic systems, one of the most common approaches is to diagonalize random Hamiltonian matrices and to study the eigenvalues and eigenfunctions of a single electron in the presence of a random potential. An effort to implement a matrix diagonalization routine for real symmetric dense matrices on massively parallel SIMD computers, the Maspar MP-1 and MP-2 systems, is described. Results of numerical tests and timings are also presented

NASA Technical Reports Server

Massively parallel boundary integral element method modeling of particles in a low Reynolds Number Newtonian fluid flow

Author: L A Mondy3
M S Ingberl
S R Subia2
Publication venue
Publication date: 11/04/2020
Field of study

Abstract The analysis of many complex multiphase fluid flow systems is based on a scale decoupling procedure. At the macroscale continuum models are used to perform large-scale simulations. At the mesoscale statistical homogc+ nization theory is used to derive continuum models based on representative volume elements (RVES). At the microscale small-scale features, such as interracial properties, are analyzed to be incorporated into mesoscale simulations. In this research mesoscopic simulations of hard particles suspended in a Newtonian fluid undergoing nonheax shear flow are performed using a boundary element method. To obtain an RVE at higher concentrations, several hundred particles are included in the simulations, putting considerable demands on the computational resources both in terms of CPU and memory. Parallel computing provides a viable platform to study these large multiphase systems. The implementation of a portable, parallel computer code based on the boundary element method using a blocl-block data distribution is discussed in thk paper. The code employs updated direct-solver technologies that make use of dual-processor compute nodes

CiteSeerX

Boundary-element parallel-computing algorithm for the microstructural analysis of general composites.

Author: Ara?jo Francisco C?lio de
D'Azevedo Eduardo F.
Gray Leonard J.
Publication venue
Publication date: 01/01/2010
Field of study

A standard continuum-mechanics-based 3D boundary-element (BE) algorithm has been devised to the microstructural modeling of complex heterogeneous solids such as general composites. In the particular applications of this paper, the mechanical properties of carbon-nanotube?reinforced composites are estimated from three-dimensional representative volume elements (RVEs). The shell-like thin-walled carbon nanotubes (CNTs) are also simulated with 3D BE models, and a generic subregion-by-subregion (SBS) algorithm makes the microstructural description of the CNT?polymer systems possible. In fact, based on this algorithm, a general scalable BE parallel code is proposed. Square and hexagonal fiber-packing patterns are considered to simulate the 3D composite microstructures

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

REPOSITORIO INSTITUCIONAL DA UFOP

Parallel computing 2011, ParCo 2011: book of abstracts

Author: D'Hollander Erik
Publication venue: Academia Press Scientific Publishers
Publication date: 01/01/2011
Field of study

This book contains the abstracts of the presentations at the conference Parallel Computing 2011, 30 August - 2 September 2011, Ghent, Belgiu

Ghent University Academic Bibliography

A multiple-SIMD architecture for image and tracking analysis

Author: Kerbyson Darren James
Publication venue
Publication date
Field of study

The computational requirements for real-time image based applications are such as to warrant the use of a parallel architecture. Commonly used parallel architectures conform to the classifications of Single Instruction Multiple Data (SIMD), or Multiple Instruction Multiple Data (MIMD). Each class of architecture has its advantages and dis-advantages. For example, SIMD architectures can be used on data-parallel problems, such as the processing of an image. Whereas MIMD architectures are more flexible and better suited to general purpose computing. Both types of processing are typically required for the analysis of the contents of an image. This thesis describes a novel massively parallel heterogeneous architecture, implemented as the Warwick Pyramid Machine. Both SIMD and MIMD processor types are combined within this architecture. Furthermore, the SIMD array is partitioned, into smaller SIMD sub-arrays, forming a Multiple-SIMD array. Thus, local data parallel, global data parallel, and control parallel processing are supported. After describing the present options available in the design of massively parallel machines and the nature of the image analysis problem, the architecture of the Warwick Pyramid Machine is described in some detail. The performance of this architecture is then analysed, both in terms of peak available computational power and in terms of representative applications in image analysis and numerical computation. Two tracking applications are also analysed to show the performance of this architecture. In addition, they illustrate the possible partitioning of applications between the SIMD and MIMD processor arrays. Load-balancing techniques are then described which have the potential to increase the utilisation of the Warwick Pyramid Machine at run-time. These include mapping techniques for image regions across the Multiple-SIMD arrays, and for the compression of sparse data. It is envisaged that these techniques may be found useful in other parallel systems

Warwick Research Archives Portal Repository

On Dynamic Graph Partitioning and Graph Clustering using Diffusion

Author: Gehweiler Joachim
Meyerhenke Henning
Publication venue: Dagstuhl Seminar Proceedings. 10261 - Algorithm Engineering
Publication date: 01/01/2010
Field of study

Dagstuhl Research Online Publication Server

A study of memory-aware scheduling in message driven parallel programs

Author: Chao Mei
Isaac Dooley
Jonathan Liffl
Laxmikant V. Kale
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

Abstract—This paper presents a simple, but powerful memory-aware scheduling mechanism that adaptively schedules tasks in a message driven distributed-memory parallel program. The scheduler adapts its behavior whenever memory usage exceeds a threshold by scheduling tasks known to reduce memory usage. The usefulness of the scheduler and its low overhead are demonstrated in the context of an LU matrix factorization program. In the LU program, only a single additional line of code is required to make use of the new general-purpose memory-aware scheduling mechanism. Without memory-aware scheduling, the LU program can only run with small problem sizes, but with the new memory-aware scheduling, the program scales to larger problem sizes. I

CiteSeerX

Crossref

Optimal load balancing techniques for block-cyclic decompositions for matrix factorization

Author: Strazdins Peter
Publication venue
Publication date: 01/01/1998
Field of study

In this paper, we present a new load balancing technique, called panel scattering, which is generally applicable for parallel block-partitioned dense linear algebra algorithms, such as matrix factorization. Here, the panels formed in such computation are divided across their length, and evenly (re-)distributed among all processors. It is shown how this technique can be eÆciently implemented for the general block-cyclic matrix distribution, requiring only the collective communication primitives that required for block-cyclic parallel BLAS. In most situations, panel scattering yields optimal load balance and cell computation speed across all stages of the computation. It has also advantages in naturally yielding good memory access patterns. Compared with traditional methods which minimize communication costs at the expense of load balance, it has a small (in some situations negative) increase in communication volume costs. It however incurs extra communication startup costs, but only by a factor not exceeding 2. To maximize load balance and minimize the cost of panel re-distribution, storage block sizes should be kept small; furthermore, in many situations of interest, there will be no significant communication startup penalty for doing so. Results will be given on the Fujitsu AP+ parallel computer, which will compare the performance of panel scattering with previously established methods, for LU, LLT and QR factorization. These are consistent with a detailed performance model for LU factorization for each method that is developed here

The Australian National University

Elemental: A new framework for distributed memory dense matrix computations

Author: Bryan Marker
Jack Poulson
Robert Van De Geijn
Publication venue
Publication date: 05/03/2020
Field of study

Abstract Parallelizing dense matrix computations to distributed memory architectures is a well-studied subject and generally considered to be among the best understood domains of parallel computing. Two packages, developed in the mid 1990s, still enjoy regular use: ScaLAPACK and PLAPACK. With the advent of many-core architectures, which may very well take the shape of distributed memory architectures within a single processor, these packages must be revisited since it will likely not be practical to use MPI-based implementations. Thus, this is a good time to review what lessons we have learned since the introduction of these two packages and to propose a simple yet effective alternative. Preliminary performance results show the new solution achieves considerably better performance than the previously developed libraries

CiteSeerX

A Low Communication Condensation-based Linear System Solver Utilizing Cramer\u27s Rule

Author: Habgood Kenneth C
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/08/2011
Field of study

Systems of linear equations are central to many science and engineering application domains. Given the abundance of low-cost parallel processing fabrics, the study of fast and accurate parallel algorithms for solving such systems is receiving attention. Fast linear solvers generally use a form of LU factorization. These methods face challenges with workload distribution and communication overhead that hinder their application in a true broadcast communication environment. Presented is an efficient framework for solving large-scale linear systems by means of a novel utilization of Cramer\u27s rule. While the latter is often perceived to be impractical when considered for large systems, it is shown that the algorithm proposed has an order N^3 complexity with pragmatic forward and backward stability. To the best of our knowledge, this is the first time that Cramer\u27s rule has been demonstrated to be an order N^3 process. Empirical results are provided to substantiate the stated accuracy and computational complexity, clearly demonstrating the efficacy of the approach taken. The unique utilization of Cramer\u27s rule and matrix condensation techniques yield an elegant process that can be applied to parallel computing architectures that support a broadcast communication infrastructure. The regularity of the communication patterns, and send-ahead ability, yields a viable framework for solving linear equations using conventional computing platforms. In addition, this dissertation demonstrates the algorithm\u27s potential for solving large-scale sparse linear systems

University of Tennessee, Knoxville: Trace