1,475,733 research outputs found

    Performance Analysis of a Novel GPU Computation-to-core Mapping Scheme for Robust Facet Image Modeling

    Get PDF
    Though the GPGPU concept is well-known in image processing, much more work remains to be done to fully exploit GPUs as an alternative computation engine. This paper investigates the computation-to-core mapping strategies to probe the efficiency and scalability of the robust facet image modeling algorithm on GPUs. Our fine-grained computation-to-core mapping scheme shows a significant performance gain over the standard pixel-wise mapping scheme. With in-depth performance comparisons across the two different mapping schemes, we analyze the impact of the level of parallelism on the GPU computation and suggest two principles for optimizing future image processing applications on the GPU platform

    Efficient Process-to-Node Mapping Algorithms for Stencil Computations

    Full text link
    Good process-to-compute-node mappings can be decisive for well performing HPC applications. A special, important class of process-to-node mapping problems is the problem of mapping processes that communicate in a sparse stencil pattern to Cartesian grids. By thoroughly exploiting the inherently present structure in this type of problem, we devise three novel distributed algorithms that are able to handle arbitrary stencil communication patterns effectively. We analyze the expected performance of our algorithms based on an abstract model of inter- and intra-node communication. An extensive experimental evaluation on several HPC machines shows that our algorithms are up to two orders of magnitude faster in running time than a (sequential) high-quality general graph mapping tool, while obtaining similar results in communication performance. Furthermore, our algorithms also achieve significantly better mapping quality compared to previous state-of-the-art Cartesian grid mapping algorithms. This results in up to a threefold performance improvement of an MPI_Neighbor_alltoall exchange operation. Our new algorithms can be used to implement the MPI_Cart_create functionality.Comment: 18 pages, 9 Figure

    DReAM: Dynamic Re-arrangement of Address Mapping to Improve the Performance of DRAMs

    Full text link
    The initial location of data in DRAMs is determined and controlled by the 'address-mapping' and even modern memory controllers use a fixed and run-time-agnostic address mapping. On the other hand, the memory access pattern seen at the memory interface level will dynamically change at run-time. This dynamic nature of memory access pattern and the fixed behavior of address mapping process in DRAM controllers, implied by using a fixed address mapping scheme, means that DRAM performance cannot be exploited efficiently. DReAM is a novel hardware technique that can detect a workload-specific address mapping at run-time based on the application access pattern which improves the performance of DRAMs. The experimental results show that DReAM outperforms the best evaluated address mapping on average by 9%, for mapping-sensitive workloads, by 2% for mapping-insensitive workloads, and up to 28% across all the workloads. DReAM can be seen as an insurance policy capable of detecting which scenarios are not well served by the predefined address mapping

    Maximum Likelihood Decoder for Index Coded PSK Modulation for Priority Ordered Receivers

    Full text link
    Index coded PSK modulation over an AWGN broadcast channel, for a given index coding problem (ICP) is studied. For a chosen index code and an arbitrary mapping (of broadcast vectors to PSK signal points), we have derived a decision rule for the maximum likelihood (ML) decoder. The message error performance of a receiver at high SNR is characterized by a parameter called PSK Index Coding Gain (PSK-ICG). The PSK-ICG of a receiver is determined by a metric called minimum inter-set distance. For a given ICP with an order of priority among the receivers, and a chosen 2N2^N-PSK constellation we propose an algorithm to find (index code, mapping) pairs, each of which gives the best performance in terms of PSK-ICG of the receivers. No other pair of index code (of length NN with 2N2^N broadcast vectors) and mapping can give a better PSK-ICG for the highest priority receiver. Also, given that the highest priority receiver achieves its best performance, the next highest priority receiver achieves its maximum gain possible and so on in the specified order or priority.Comment: 9 pages, 6 figures and 2 table

    Task mapping on a dragonfly supercomputer

    Full text link
    The dragonfly network topology has recently gained traction in the design of high performance computing (HPC) systems and has been implemented in large-scale supercomputers. The impact of task mapping, i.e., placement of MPI ranks onto compute cores, on the communication performance of applications on dragonfly networks has not been comprehensively investigated on real large-scale systems. This paper demonstrates that task mapping affects the communication overhead significantly in dragonflies and the magnitude of this effect is sensitive to the application, job size, and the OpenMP settings. Among the three task mapping algorithms we study (in-order, random, and recursive coordinate bisection), selecting a suitable task mapper reduces application communication time by up to 47%

    Unified bit-based probabilistic data association aided MIMO detection for high-order QAM constellations

    No full text
    A unified Bit-based Probabilistic Data Association (B-PDA) detection approach is proposed for Multiple-Input Multiple-Output (MIMO) systems employing high-order rectangular Quadrature Amplitude Modulation (QAM). The new approach transforms the symbol detection process of QAM to a bit-based process by introducing a Unified Matrix Representation (UMR) of QAM. Both linear natural and nonlinear binary reflected Gray bit-to-symbol mappings are considered. With the aid of simulation results, we demonstrate that the linear natural mapping based B-PDA approach typically attained an improved detection performance (measured in terms of both Bit Error Ratio (BER) and Symbol Error Ratio (SER)) in comparison to the conventional symbol-based PDA aided MIMO detector, despite its dramatically reduced computational complexity. The only exception is that at low SNRs, the linear natural mapping based B-PDA is slightly inferior in terms of its BER to the conventional symbol-based PDA using binary reflected Gray mapping. Furthermore, the simulation results show that the linear natural mapping based B-PDA MIMO detector may approach the best-case performance provided by the nonlinear binary reflected Gray mapping based B-PDA MIMO detector under ideal conditions. Additionally, the implementation of the B-PDA MIMO detector is shown to be much simpler in the case of the linear natural mapping. Based on these two points, we conclude that in the context of the uncoded B-PDA MIMO detector it is preferable to use the linear natural bit-to-symbol mapping, rather than the nonlinear Gray mapping
    corecore