365 research outputs found

    ReSHAPE: A Framework for Dynamic Resizing and Scheduling of Homogeneous Applications in a Parallel Environment

    Get PDF
    Applications in science and engineering often require huge computational resources for solving problems within a reasonable time frame. Parallel supercomputers provide the computational infrastructure for solving such problems. A traditional application scheduler running on a parallel cluster only supports static scheduling where the number of processors allocated to an application remains fixed throughout the lifetime of execution of the job. Due to the unpredictability in job arrival times and varying resource requirements, static scheduling can result in idle system resources thereby decreasing the overall system throughput. In this paper we present a prototype framework called ReSHAPE, which supports dynamic resizing of parallel MPI applications executed on distributed memory platforms. The framework includes a scheduler that supports resizing of applications, an API to enable applications to interact with the scheduler, and a library that makes resizing viable. Applications executed using the ReSHAPE scheduler framework can expand to take advantage of additional free processors or can shrink to accommodate a high priority application, without getting suspended. In our research, we have mainly focused on structured applications that have two-dimensional data arrays distributed across a two-dimensional processor grid. The resize library includes algorithms for processor selection and processor mapping. Experimental results show that the ReSHAPE framework can improve individual job turn-around time and overall system throughput.Comment: 15 pages, 10 figures, 5 tables Submitted to International Conference on Parallel Processing (ICPP'07

    Performance of scientific processing in networks of workstations: matrix multiplication example

    Get PDF
    Parallel computing on networks of workstations are intensively used in some application areas such as linear algebra operations. Topics such as processing as well as communication hardware heterogeneity are considered solved by the use of parallel processing libraries, but experimentation about performance under these circumstances seems to be necessary. Also, installed networks of workstations are specially attractive due to its extremely low cost for parallel processing as well as its great availability given the number of installed local area networks. The performance of such networks of workstations is fully analyzed by means of a simple application: matrix multiplication. A parallel algorithm is proposed for matrix multiplication derived from two main sources: a) previous proposed algorithms for this task in traditional parallel computers, and b) the bus based interconnection network of workstations. This parallel algorithm is analyzed experimentally in terms of workstations workload and data communication, two main factors in overall parallel computing performance

    Power management and optimization

    Get PDF
    After many years of focusing on “faster” computers, people have started taking notice of the fact that the race for “speed” has had the unfortunate side effect of increasing the total power consumed, thereby increasing the total cost of ownership of these machines. The heat produced has required expensive cooling facilities. As a result, it is difficult to ignore the growing trend of “Green Computing,” which is defined by San Murugesan as “the study and practice of designing, manufacturing, using, and disposing of computers, servers, and associated subsystems – such as monitors, printers, storage devices, and networking and communication systems – efficiently and effectively with minimal or no impact on the environment”. There have been different approaches to green computing, some of which include data center power management, operating system support, power supply, storage hardware, video card and display hardware, resource allocation, virtualization, terminal servers and algorithmic efficiency. In this thesis, we particularly study the relation between algorithmic efficiency and power consumption, obtaining performance models in the process. The algorithms studied primarily include basic linear algebra routines, such as matrix and vector multiplications and iterative solvers. Our studies show that it if the source code is optimized and tuned to the particular hardware used, there is a possibility of reducing the total power consumed at only slight costs to the computation time. The data sets utilized in this thesis are not significantly large and consequently, the power savings are not large either. However, as these optimizations can be scaled to larger data sets, it presents a positive outlook for power savings in much larger research environments

    Performance of scientific processing in networks of workstations: matrix multiplication example

    Get PDF
    Parallel computing on networks of workstations are intensively used in some application areas such as linear algebra operations. Topics such as processing as well as communication hardware heterogeneity are considered solved by the use of parallel processing libraries, but experimentation about performance under these circumstances seems to be necessary. Also, installed networks of workstations are specially attractive due to its extremely low cost for parallel processing as well as its great availability given the number of installed local area networks. The performance of such networks of workstations is fully analyzed by means of a simple application: matrix multiplication. A parallel algorithm is proposed for matrix multiplication derived from two main sources: a) previous proposed algorithms for this task in traditional parallel computers, and b) the bus based interconnection network of workstations. This parallel algorithm is analyzed experimentally in terms of workstations workload and data communication, two main factors in overall parallel computing performance.Facultad de Informátic

    A Coarse-Grain Parallel Implementation of the Block Tridiagonal Divide and Conquer Algorithm for Symmetric Eigenproblems.

    Get PDF
    Cuppen’s divide and conquer technique for symmetric tridiagonal eigenproblems, along with Gu and Eisenstat’s modification for improvement of the eigenvector computation, has yielded a stable, efficient, and widely-used algorithm. This algorithm has now been extended to a larger class of matrices, namely symmetric block tridiagonal eigenproblems. The Block Tridiagonal Divide and Conquer algorithm has shown several characteristics that make it suitable for a number of applications, such as the Self-Consistent-Field procedure in quantum chemistry. This thesis discusses the steps taken to implement a coarse-grain parallel version of the Block Tridiagonal Divide and Conquer algorithm, suitable for a parallel supercomputer or a cluster of machines. The parallel version relies on components of the ScaLAPACK parallel linear algebra library and follows the same model as the serial code, including the implementation of full deflation. A modest speedup (2 to 3) was achieved using a few processors (4 and 16). Increasing the number of processors from 4 to 16 produced only slightly better speedup. This implementation was not competitive with the standard ScaLAPACK symmetric eigenvalue routine. Analysis shows that the distribution scheme chosen for the eigenvector storage requires n x O(p)2 function calls to the ScaLAPACK matrix multiplication routine, where n is the matrix size and p is the number of blocks. The matrix multiplications are responsible for the majority of the computational cost; therefore, the associated overhead needs to be reduced in order to make this implementation more competitive

    Weak scalability analysis of the distributed-memory parallel MLFMA

    Get PDF
    Distributed-memory parallelization of the multilevel fast multipole algorithm (MLFMA) relies on the partitioning of the internal data structures of the MLFMA among the local memories of networked machines. For three existing data partitioning schemes (spatial, hybrid and hierarchical partitioning), the weak scalability, i.e., the asymptotic behavior for proportionally increasing problem size and number of parallel processes, is analyzed. It is demonstrated that none of these schemes are weakly scalable. A nontrivial change to the hierarchical scheme is proposed, yielding a parallel MLFMA that does exhibit weak scalability. It is shown that, even for modest problem sizes and a modest number of parallel processes, the memory requirements of the proposed scheme are already significantly lower, compared to existing schemes. Additionally, the proposed scheme is used to perform full-wave simulations of a canonical example, where the number of unknowns and CPU cores are proportionally increased up to more than 200 millions of unknowns and 1024 CPU cores. The time per matrix-vector multiplication for an increasing number of unknowns and CPU cores corresponds very well to the theoretical time complexity
    • …
    corecore