Search CORE

365 research outputs found

ReSHAPE: A Framework for Dynamic Resizing and Scheduling of Homogeneous Applications in a Parallel Environment

Author: Ribbens Calvin J.
Sudarsan Rajesh
Publication venue
Publication date: 01/01/2007
Field of study

Applications in science and engineering often require huge computational resources for solving problems within a reasonable time frame. Parallel supercomputers provide the computational infrastructure for solving such problems. A traditional application scheduler running on a parallel cluster only supports static scheduling where the number of processors allocated to an application remains fixed throughout the lifetime of execution of the job. Due to the unpredictability in job arrival times and varying resource requirements, static scheduling can result in idle system resources thereby decreasing the overall system throughput. In this paper we present a prototype framework called ReSHAPE, which supports dynamic resizing of parallel MPI applications executed on distributed memory platforms. The framework includes a scheduler that supports resizing of applications, an API to enable applications to interact with the scheduler, and a library that makes resizing viable. Applications executed using the ReSHAPE scheduler framework can expand to take advantage of additional free processors or can shrink to accommodate a high priority application, without getting suspended. In our research, we have mainly focused on structured applications that have two-dimensional data arrays distributed across a two-dimensional processor grid. The resize library includes algorithms for processor selection and processor mapping. Experimental results show that the ReSHAPE framework can improve individual job turn-around time and overall system throughput.Comment: 15 pages, 10 figures, 5 tables Submitted to International Conference on Parallel Processing (ICPP'07

arXiv.org e-Print Archive

Computer Science Technical Reports @Virginia Tech

CiteSeerX

Performance of scientific processing in networks of workstations: matrix multiplication example

Author: Tinetti Fernando Gustavo
Publication venue
Publication date: 01/01/2001
Field of study

Parallel computing on networks of workstations are intensively used in some application areas such as linear algebra operations. Topics such as processing as well as communication hardware heterogeneity are considered solved by the use of parallel processing libraries, but experimentation about performance under these circumstances seems to be necessary. Also, installed networks of workstations are specially attractive due to its extremely low cost for parallel processing as well as its great availability given the number of installed local area networks. The performance of such networks of workstations is fully analyzed by means of a simple application: matrix multiplication. A parallel algorithm is proposed for matrix multiplication derived from two main sources: a) previous proposed algorithms for this task in traditional parallel computers, and b) the bus based interconnection network of workstations. This parallel algorithm is analyzed experimentally in terms of workstations workload and data communication, two main factors in overall parallel computing performance

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Centro de Servicios en Gestión de Información

Power management and optimization

Author: Sundararajan Hari
Publication venue: LSU Digital Commons
Publication date: 01/01/2011
Field of study

After many years of focusing on “faster” computers, people have started taking notice of the fact that the race for “speed” has had the unfortunate side effect of increasing the total power consumed, thereby increasing the total cost of ownership of these machines. The heat produced has required expensive cooling facilities. As a result, it is difficult to ignore the growing trend of “Green Computing,” which is defined by San Murugesan as “the study and practice of designing, manufacturing, using, and disposing of computers, servers, and associated subsystems – such as monitors, printers, storage devices, and networking and communication systems – efficiently and effectively with minimal or no impact on the environment”. There have been different approaches to green computing, some of which include data center power management, operating system support, power supply, storage hardware, video card and display hardware, resource allocation, virtualization, terminal servers and algorithmic efficiency. In this thesis, we particularly study the relation between algorithmic efficiency and power consumption, obtaining performance models in the process. The algorithms studied primarily include basic linear algebra routines, such as matrix and vector multiplications and iterative solvers. Our studies show that it if the source code is optimized and tuned to the particular hardware used, there is a possibility of reducing the total power consumed at only slight costs to the computation time. The data sets utilized in this thesis are not significantly large and consequently, the power savings are not large either. However, as these optimizations can be scaled to larger data sets, it presents a positive outlook for power savings in much larger research environments

Louisiana State University

Performance of scientific processing in networks of workstations: matrix multiplication example

Author: Tinetti Fernando Gustavo
Publication venue
Publication date: 01/01/2001
Field of study

A Coarse-Grain Parallel Implementation of the Block Tridiagonal Divide and Conquer Algorithm for Symmetric Eigenproblems.

Author: Day Robert M.
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/05/2003
Field of study

Cuppen’s divide and conquer technique for symmetric tridiagonal eigenproblems, along with Gu and Eisenstat’s modification for improvement of the eigenvector computation, has yielded a stable, efficient, and widely-used algorithm. This algorithm has now been extended to a larger class of matrices, namely symmetric block tridiagonal eigenproblems. The Block Tridiagonal Divide and Conquer algorithm has shown several characteristics that make it suitable for a number of applications, such as the Self-Consistent-Field procedure in quantum chemistry. This thesis discusses the steps taken to implement a coarse-grain parallel version of the Block Tridiagonal Divide and Conquer algorithm, suitable for a parallel supercomputer or a cluster of machines. The parallel version relies on components of the ScaLAPACK parallel linear algebra library and follows the same model as the serial code, including the implementation of full deflation. A modest speedup (2 to 3) was achieved using a few processors (4 and 16). Increasing the number of processors from 4 to 16 produced only slightly better speedup. This implementation was not competitive with the standard ScaLAPACK symmetric eigenvalue routine. Analysis shows that the distribution scheme chosen for the eigenvector storage requires n x O(p)2 function calls to the ScaLAPACK matrix multiplication routine, where n is the matrix size and p is the number of blocks. The matrix multiplications are responsible for the majority of the computational cost; therefore, the associated overhead needs to be reduced in order to make this implementation more competitive

University of Tennessee, Knoxville: Trace

Weak scalability analysis of the distributed-memory parallel MLFMA

Author: Bogaert Ignace
De Zutter Daniël
Fostier Jan
Michiels Bart
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2013
Field of study

Distributed-memory parallelization of the multilevel fast multipole algorithm (MLFMA) relies on the partitioning of the internal data structures of the MLFMA among the local memories of networked machines. For three existing data partitioning schemes (spatial, hybrid and hierarchical partitioning), the weak scalability, i.e., the asymptotic behavior for proportionally increasing problem size and number of parallel processes, is analyzed. It is demonstrated that none of these schemes are weakly scalable. A nontrivial change to the hierarchical scheme is proposed, yielding a parallel MLFMA that does exhibit weak scalability. It is shown that, even for modest problem sizes and a modest number of parallel processes, the memory requirements of the proposed scheme are already significantly lower, compared to existing schemes. Additionally, the proposed scheme is used to perform full-wave simulations of a canonical example, where the number of unknowns and CPU cores are proportionally increased up to more than 200 millions of unknowns and 1024 CPU cores. The time per matrix-vector multiplication for an increasing number of unknowns and CPU cores corresponds very well to the theoretical time complexity

Ghent University Academic Bibliography