710 research outputs found

    Distributed-Memory Breadth-First Search on Massive Graphs

    Full text link
    This chapter studies the problem of traversing large graphs using the breadth-first search order on distributed-memory supercomputers. We consider both the traditional level-synchronous top-down algorithm as well as the recently discovered direction optimizing algorithm. We analyze the performance and scalability trade-offs in using different local data structures such as CSR and DCSC, enabling in-node multithreading, and graph decompositions such as 1D and 2D decomposition.Comment: arXiv admin note: text overlap with arXiv:1104.451

    Vcluster: A Portable Virtual Computing Library For Cluster Computing

    Get PDF
    Message passing has been the dominant parallel programming model in cluster computing, and libraries like Message Passing Interface (MPI) and Portable Virtual Machine (PVM) have proven their novelty and efficiency through numerous applications in diverse areas. However, as clusters of Symmetric Multi-Processor (SMP) and heterogeneous machines become popular, conventional message passing models must be adapted accordingly to support this new kind of clusters efficiently. In addition, Java programming language, with its features like object oriented architecture, platform independent bytecode, and native support for multithreading, makes it an alternative language for cluster computing. This research presents a new parallel programming model and a library called VCluster that implements this model on top of a Java Virtual Machine (JVM). The programming model is based on virtual migrating threads to support clusters of heterogeneous SMP machines efficiently. VCluster is implemented in 100% Java, utilizing the portability of Java to address the problems of heterogeneous machines. VCluster virtualizes computational and communication resources such as threads, computation states, and communication channels across multiple separate JVMs, which makes a mobile thread possible. Equipped with virtual migrating thread, it is feasible to balance the load of computing resources dynamically. Several large scale parallel applications have been developed using VCluster to compare the performance and usage of VCluster with other libraries. The results of the experiments show that VCluster makes it easier to develop multithreading parallel applications compared to conventional libraries like MPI. At the same time, the performance of VCluster is comparable to MPICH, a widely used MPI library, combined with popular threading libraries like POSIX Thread and OpenMP. In the next phase of our work, we implemented thread group and thread migration to demonstrate the feasibility of dynamic load balancing in VCluster. We carried out experiments to show that the load can be dynamically balanced in VCluster, resulting in a better performance. Thread group also makes it possible to implement collective communication functions between threads, which have been proved to be useful in process based libraries

    Performance driven distributed scheduling of parallel hybrid computations

    Get PDF
    AbstractExascale computing is fast becoming a mainstream research area. In order to realize exascale performance, it is necessary to have efficient scheduling of large parallel computations with scalable performance on a large number of cores/processors. The scheduler needs to execute in a pure distributed and online fashion, should follow affinity inherent in the computation and must have low time and message complexity. Further, it should also avoid physical deadlocks due to bounded resources including space/memory per core. Simultaneous consideration of these factors makes affinity driven distributed scheduling particularly challenging. We attempt to address this challenge for hybrid parallel computations which contain tasks that have pre-specified affinity to a place and also tasks that can be mapped to any place in the system. Specifically, we address two scheduling problems of the type Pm|Mj,prec|Cmax. This paper presents online distributed scheduling algorithms for hybrid parallel computations assuming both unconstrained and bounded space per place. We also present the time and message complexity for distributed scheduling of hybrid computations. To the best of our knowledge, this is the first time that distributed scheduling algorithms for hybrid parallel computations have been presented and analyzed for time and message bounds under both unconstrained space and bounded space

    Overlapping of Communication and Computation and Early Binding: Fundamental Mechanisms for Improving Parallel Performance on Clusters of Workstations

    Get PDF
    This study considers software techniques for improving performance on clusters of workstations and approaches for designing message-passing middleware that facilitate scalable, parallel processing. Early binding and overlapping of communication and computation are identified as fundamental approaches for improving parallel performance and scalability on clusters. Currently, cluster computers using the Message-Passing Interface for interprocess communication are the predominant choice for building high-performance computing facilities, which makes the findings of this work relevant to a wide audience from the areas of high-performance computing and parallel processing. The performance-enhancing techniques studied in this work are presently underutilized in practice because of the lack of adequate support by existing message-passing libraries and are also rarely considered by parallel algorithm designers. Furthermore, commonly accepted methods for performance analysis and evaluation of parallel systems omit these techniques and focus primarily on more obvious communication characteristics such as latency and bandwidth. This study provides a theoretical framework for describing early binding and overlapping of communication and computation in models for parallel programming. This framework defines four new performance metrics that facilitate new approaches for performance analysis of parallel systems and algorithms. This dissertation provides experimental data that validate the correctness and accuracy of the performance analysis based on the new framework. The theoretical results of this performance analysis can be used by designers of parallel system and application software for assessing the quality of their implementations and for predicting the effective performance benefits of early binding and overlapping. This work presents MPI/Pro, a new MPI implementation that is specifically optimized for clusters of workstations interconnected with high-speed networks. This MPI implementation emphasizes features such as persistent communication, asynchronous processing, low processor overhead, and independent message progress. These features are identified as critical for delivering maximum performance to applications. The experimental section of this dissertation demonstrates the capability of MPI/Pro to facilitate software techniques that result in significant application performance improvements. Specific demonstrations with Virtual Interface Architecture and TCP/IP over Ethernet are offered

    Cluster Computing in the Classroom: Topics, Guidelines, and Experiences

    Get PDF
    With the progress of research on cluster computing, more and more universities have begun to offer various courses covering cluster computing. A wide variety of content can be taught in these courses. Because of this, a difficulty that arises is the selection of appropriate course material. The selection is complicated by the fact that some content in cluster computing is also covered by other courses such as operating systems, networking, or computer architecture. In addition, the background of students enrolled in cluster computing courses varies. These aspects of cluster computing make the development of good course material difficult. Combining our experiences in teaching cluster computing in several universities in the USA and Australia and conducting tutorials at many international conferences all over the world, we present prospective topics in cluster computing along with a wide variety of information sources (books, software, and materials on the web) from which instructors can choose. The course material described includes system architecture, parallel programming, algorithms, and applications. Instructors are advised to choose selected units in each of the topical areas and develop their own syllabus to meet course objectives. For example, a full course can be taught on system architecture for core computer science students. Or, a course on parallel programming could contain a brief coverage of system architecture and then devote the majority of time to programming methods. Other combinations are also possible. We share our experiences in teaching cluster computing and the topics we have chosen depending on course objectives
    • …
    corecore