1,316 research outputs found

    Optimizing Communication for Massively Parallel Processing

    Get PDF
    The current trends in high performance computing show that large machines with tens of thousands of processors will soon be readily available. The IBM Bluegene-L machine with 128k processors (which is currently being deployed) is an important step in this direction. In this scenario, it is going to be a significant burden for the programmer to manually scale his applications. This task of scaling involves addressing issues like load-imbalance and communication overhead. In this thesis, we explore several communication optimizations to help parallel applications to easily scale on a large number of processors. We also present automatic runtime techniques to relieve the programmer from the burden of optimizing communication in his applications. This thesis explores processor virtualization to improve communication performance in applications. With processor virtualization, the computation is mapped to virtual processors (VPs). After one VP has finished computation and is waiting for responses to its messages, another VP can compute, thus overlapping communication with computation. This overlap is only effective if the processor overhead of the communication operation is a small fraction of the total communication time. Fortunately, with network interfaces having co-processors, this happens to be true and processor virtualization has a natural advantage on such interconnects. The communication optimizations we present in this thesis, are motivated by applications such as NAMD (a classical molecular dynamics application) and CPAIMD (a quantum chemistry application). Applications like NAMD and CPAIMD consume a fair share of the time available on supercomputers. So, improving their performance would be of great value. We have successfully scaled NAMD to 1TF of peak performance on 3000 processors of PSC Lemieux, using the techniques presented in this thesis. We study both point-to-point communication and collective communication (specifically all-to-all communication). On a large number of processors all-to-all communication can take several milli-seconds to finish. With synchronous collectives defined in MPI, the processor idles while the collective messages are in flight. Therefore, we demonstrate an asynchronous collective communication framework, to let the CPU compute while the all-to-all messages are in flight. We also show that the best strategy for all-to-all communication depends on the message size, number of processors and other dynamic parameters. This suggests that these parameters can be observed at runtime and used to choose the optimal strategy for all-to-all communication. In this thesis, we demonstrate adaptive strategy switching for all-to-all communication. The communication optimization framework presented in this thesis, has been designed to optimize communication in the context of processor virtualization and dynamic migrating objects. We present the streaming strategy to optimize fine grained object-to-object communication. In this thesis, we motivate the need for hardware collectives, as processor based collectives can be delayed by intermediate that processors busy with computation. We explore a next generation interconnect that supports collectives in the switching hardware. We show the performance gains of hardware collectives through synthetic benchmarks

    Proof of the bounded conformal conjecture

    Full text link
    Given any asymptotically flat 3-manifold (M,g)(M,g) with smooth, non-empty, compact boundary Σ\Sigma, the conformal conjecture states that for every δ>0\delta>0, there exists a metric g=u4gg' = u^4 g, with uu a harmonic function, such that the area of outermost minimal area enclosure Σg\overline{\Sigma}_{g'} of Σ\Sigma with respect to gg': ΣΣgg<δ|\Sigma - \overline{\Sigma}_{g'}|_g' < \delta. Recently, the conjecture was used to prove the Riemannian Penrose inequality for black holes with zero horizon area, and was proven to be true under the assumption of the existence of a finite number of minimal area enclosures of the boundary Σ\Sigma, and boundedness of the harmonic function uu. We prove the conjecture assuming only the boundedness of uu

    Explicit Space-Time Codes Achieving The Diversity-Multiplexing Gain Tradeoff

    Full text link
    A recent result of Zheng and Tse states that over a quasi-static channel, there exists a fundamental tradeoff, referred to as the diversity-multiplexing gain (D-MG) tradeoff, between the spatial multiplexing gain and the diversity gain that can be simultaneously achieved by a space-time (ST) block code. This tradeoff is precisely known in the case of i.i.d. Rayleigh-fading, for T>= n_t+n_r-1 where T is the number of time slots over which coding takes place and n_t,n_r are the number of transmit and receive antennas respectively. For T < n_t+n_r-1, only upper and lower bounds on the D-MG tradeoff are available. In this paper, we present a complete solution to the problem of explicitly constructing D-MG optimal ST codes, i.e., codes that achieve the D-MG tradeoff for any number of receive antennas. We do this by showing that for the square minimum-delay case when T=n_t=n, cyclic-division-algebra (CDA) based ST codes having the non-vanishing determinant property are D-MG optimal. While constructions of such codes were previously known for restricted values of n, we provide here a construction for such codes that is valid for all n. For the rectangular, T > n_t case, we present two general techniques for building D-MG-optimal rectangular ST codes from their square counterparts. A byproduct of our results establishes that the D-MG tradeoff for all T>= n_t is the same as that previously known to hold for T >= n_t + n_r -1.Comment: Revised submission to IEEE Transactions on Information Theor
    corecore