544 research outputs found

    QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

    Get PDF
    Previous studies have reported that common dense linear algebra operations do not achieve speed up by using multiple geographical sites of a computational grid. Because such operations are the building blocks of most scientific applications, conventional supercomputers are still strongly predominant in high-performance computing and the use of grids for speeding up large-scale scientific problems is limited to applications exhibiting parallelism at a higher level. We have identified two performance bottlenecks in the distributed memory algorithms implemented in ScaLAPACK, a state-of-the-art dense linear algebra library. First, because ScaLAPACK assumes a homogeneous communication network, the implementations of ScaLAPACK algorithms lack locality in their communication pattern. Second, the number of messages sent in the ScaLAPACK algorithms is significantly greater than other algorithms that trade flops for communication. In this paper, we present a new approach for computing a QR factorization -- one of the main dense linear algebra kernels -- of tall and skinny matrices in a grid computing environment that overcomes these two bottlenecks. Our contribution is to articulate a recently proposed algorithm (Communication Avoiding QR) with a topology-aware middleware (QCG-OMPI) in order to confine intensive communications (ScaLAPACK calls) within the different geographical sites. An experimental study conducted on the Grid'5000 platform shows that the resulting performance increases linearly with the number of geographical sites on large-scale problems (and is in particular consistently higher than ScaLAPACK's).Comment: Accepted at IPDPS10. (IEEE International Parallel & Distributed Processing Symposium 2010 in Atlanta, GA, USA.

    Kernel-assisted and Topology-aware MPI Collective Communication among Multicore or Many-core Clusters

    Get PDF
    Multicore or many-core clusters have become the most prominent form of High Performance Computing (HPC) systems. Hardware complexity and hierarchies not only exist in the inter-node layer, i.e., hierarchical networks, but also exist in internals of multicore compute nodes, e.g., Non Uniform Memory Accesses (NUMA), network-style interconnect, and memory and shared cache hierarchies. Message Passing Interface (MPI), the most widely adopted in the HPC communities, suffers from decreased performance and portability due to increased hardware complexity of multiple levels. We identified three critical issues specific to collective communication: The first problem arises from the gap between logical collective topologies and underlying hardware topologies; Second, current MPI communications lack efficient shared memory message delivering approaches; Last, on distributed memory machines, like multicore clusters, a single approach cannot encompass the extreme variations not only in the bandwidth and latency capabilities, but also in features such as the aptitude to operate multiple concurrent copies simultaneously. To bridge the gap between logical collective topologies and hardware topologies, we developed a distance-aware framework to integrate the knowledge of hardware distance into collective algorithms in order to dynamically reshape the communication patterns to suit the hardware capabilities. Based on process distance information, we used graph partitioning techniques to organize the MPI processes in a multi-level hierarchy, mapping on the hardware characteristics. Meanwhile, we took advantage of the kernel-assisted one-sided single-copy approach (KNEM) as the default shared memory delivering method. Via kernel-assisted memory copy, the collective algorithms offload copy tasks onto non-leader/not-root processes to evenly distribute copy workloads among available cores. Finally, on distributed memory machines, we developed a technique to compose multi-layered collective algorithms together to express a multi-level algorithm with tight interoperability between the levels. This tight collaboration results in more overlaps between inter- and intra-node communication. Experimental results have confirmed that, by leveraging several technologies together, such as kernel-assisted memory copy, the distance-aware framework, and collective algorithm composition, not only do MPI collectives reach the potential maximum performance on a wide variation of platforms, but they also deliver a level of performance immune to modifications of the underlying process-core binding

    Virtual Cluster Management for Analysis of Geographically Distributed and Immovable Data

    Get PDF
    Thesis (Ph.D.) - Indiana University, Informatics and Computing, 2015Scenarios exist in the era of Big Data where computational analysis needs to utilize widely distributed and remote compute clusters, especially when the data sources are sensitive or extremely large, and thus unable to move. A large dataset in Malaysia could be ecologically sensitive, for instance, and unable to be moved outside the country boundaries. Controlling an analysis experiment in this virtual cluster setting can be difficult on multiple levels: with setup and control, with managing behavior of the virtual cluster, and with interoperability issues across the compute clusters. Further, datasets can be distributed among clusters, or even across data centers, so that it becomes critical to utilize data locality information to optimize the performance of data-intensive jobs. Finally, datasets are increasingly sensitive and tied to certain administrative boundaries, though once the data has been processed, the aggregated or statistical result can be shared across the boundaries. This dissertation addresses management and control of a widely distributed virtual cluster having sensitive or otherwise immovable data sets through a controller. The Virtual Cluster Controller (VCC) gives control back to the researcher. It creates virtual clusters across multiple cloud platforms. In recognition of sensitive data, it can establish a single network overlay over widely distributed clusters. We define a novel class of data, notably immovable data that we call "pinned data", where the data is treated as a first-class citizen instead of being moved to where needed. We draw from our earlier work with a hierarchical data processing model, Hierarchical MapReduce (HMR), to process geographically distributed data, some of which are pinned data. The applications implemented in HMR use extended MapReduce model where computations are expressed as three functions: Map, Reduce, and GlobalReduce. Further, by facilitating information sharing among resources, applications, and data, the overall performance is improved. Experimental results show that the overhead of VCC is minimum. The HMR outperforms traditional MapReduce model while processing a particular class of applications. The evaluations also show that information sharing between resources and application through the VCC shortens the hierarchical data processing time, as well satisfying the constraints on the pinned data

    CRAUL: Compiler and Run-Time Integration for Adaptation under Load

    Get PDF
    corecore