106,747 research outputs found

    Data broadcasting and reduction, prefix computation, and sorting on reduced hypercube (RH) parallel computers

    Get PDF
    The binary hypercube parallel computer has been very popular due to its rich interconnection structure and small average internode distance which allow the efficient embedding of frequently used topologies. Communication patterns of many parallel algorithms also match the hypercube topology. The hypercube has high VLSI complexity. however. due to the logarithmic increase in the number of connections to each node with the increase in the number of dimensions of the hypercube. The reduced hypercube (RH) interconnection network. which is obtained by a uniform reduction in the number of links for each hypercube node. yields lower-complexity interconnection networks when compared to hypercubes with the same number of nodes. It has been shown elsewhere that the RH interconnection network achieves performance comparable to that of the hypercube. at lower hardware cost. The reduced VLSI complexity of the RH also permits the construction of larger systems. thus. making the RH suitable for massively parallel processing. This thesis proposes algorithms for data broadcasting and reduction. prefix computation, and sorting on the RH parallel computer. All these operations are fundamental to many parallel algorithms. A worst case analysis of each algorithm is given and compared with equivalent- algorithms for the regular hypercube. It is shown that the proposed algorithms for the RH yield performance comparable to that for the regular hypercube

    Average-Case Complexity of Shellsort

    Full text link
    We prove a general lower bound on the average-case complexity of Shellsort: the average number of data-movements (and comparisons) made by a pp-pass Shellsort for any incremental sequence is \Omega (pn^{1 + 1/p) for all plognp \leq \log n. Using similar arguments, we analyze the average-case complexity of several other sorting algorithms.Comment: 11 pages. Submitted to ICALP'9

    An Efficient Multiway Mergesort for GPU Architectures

    Full text link
    Sorting is a primitive operation that is a building block for countless algorithms. As such, it is important to design sorting algorithms that approach peak performance on a range of hardware architectures. Graphics Processing Units (GPUs) are particularly attractive architectures as they provides massive parallelism and computing power. However, the intricacies of their compute and memory hierarchies make designing GPU-efficient algorithms challenging. In this work we present GPU Multiway Mergesort (MMS), a new GPU-efficient multiway mergesort algorithm. MMS employs a new partitioning technique that exposes the parallelism needed by modern GPU architectures. To the best of our knowledge, MMS is the first sorting algorithm for the GPU that is asymptotically optimal in terms of global memory accesses and that is completely free of shared memory bank conflicts. We realize an initial implementation of MMS, evaluate its performance on three modern GPU architectures, and compare it to competitive implementations available in state-of-the-art GPU libraries. Despite these implementations being highly optimized, MMS compares favorably, achieving performance improvements for most random inputs. Furthermore, unlike MMS, state-of-the-art algorithms are susceptible to bank conflicts. We find that for certain inputs that cause these algorithms to incur large numbers of bank conflicts, MMS can achieve up to a 37.6% speedup over its fastest competitor. Overall, even though its current implementation is not fully optimized, due to its efficient use of the memory hierarchy, MMS outperforms the fastest comparison-based sorting implementations available to date

    Parallel Performance of MPI Sorting Algorithms on Dual-Core Processor Windows-Based Systems

    Full text link
    Message Passing Interface (MPI) is widely used to implement parallel programs. Although Windowsbased architectures provide the facilities of parallel execution and multi-threading, little attention has been focused on using MPI on these platforms. In this paper we use the dual core Window-based platform to study the effect of parallel processes number and also the number of cores on the performance of three MPI parallel implementations for some sorting algorithms

    A Parallel Monte Carlo Code for Simulating Collisional N-body Systems

    Full text link
    We present a new parallel code for computing the dynamical evolution of collisional N-body systems with up to N~10^7 particles. Our code is based on the the Henon Monte Carlo method for solving the Fokker-Planck equation, and makes assumptions of spherical symmetry and dynamical equilibrium. The principal algorithmic developments involve optimizing data structures, and the introduction of a parallel random number generation scheme, as well as a parallel sorting algorithm, required to find nearest neighbors for interactions and to compute the gravitational potential. The new algorithms we introduce along with our choice of decomposition scheme minimize communication costs and ensure optimal distribution of data and workload among the processing units. The implementation uses the Message Passing Interface (MPI) library for communication, which makes it portable to many different supercomputing architectures. We validate the code by calculating the evolution of clusters with initial Plummer distribution functions up to core collapse with the number of stars, N, spanning three orders of magnitude, from 10^5 to 10^7. We find that our results are in good agreement with self-similar core-collapse solutions, and the core collapse times generally agree with expectations from the literature. Also, we observe good total energy conservation, within less than 0.04% throughout all simulations. We analyze the performance of the code, and demonstrate near-linear scaling of the runtime with the number of processors up to 64 processors for N=10^5, 128 for N=10^6 and 256 for N=10^7. The runtime reaches a saturation with the addition of more processors beyond these limits which is a characteristic of the parallel sorting algorithm. The resulting maximum speedups we achieve are approximately 60x, 100x, and 220x, respectively.Comment: 53 pages, 13 figures, accepted for publication in ApJ Supplement
    corecore