Search CORE

1,475,733 research outputs found

Performance Analysis of a Novel GPU Computation-to-core Mapping Scheme for Robust Facet Image Modeling

Author: Cao Yong
Park Seung In
Quek Francis
Watson Layne T.
Publication venue
Publication date: 01/01/2012
Field of study

Though the GPGPU concept is well-known in image processing, much more work remains to be done to fully exploit GPUs as an alternative computation engine. This paper investigates the computation-to-core mapping strategies to probe the efficiency and scalability of the robust facet image modeling algorithm on GPUs. Our fine-grained computation-to-core mapping scheme shows a significant performance gain over the standard pixel-wise mapping scheme. With in-depth performance comparisons across the two different mapping schemes, we analyze the impact of the level of parallelism on the GPU computation and suggest two principles for optimizing future image processing applications on the GPU platform

Computer Science Technical Reports @Virginia Tech

Efficient Process-to-Node Mapping Algorithms for Stencil Computations

Author: Hunold Sascha
Lehr Markus
Schulz Christian
Träff Jesper Larsson
von Kirchbach Konrad
Publication venue
Publication date: 01/01/2020
Field of study

Good process-to-compute-node mappings can be decisive for well performing HPC applications. A special, important class of process-to-node mapping problems is the problem of mapping processes that communicate in a sparse stencil pattern to Cartesian grids. By thoroughly exploiting the inherently present structure in this type of problem, we devise three novel distributed algorithms that are able to handle arbitrary stencil communication patterns effectively. We analyze the expected performance of our algorithms based on an abstract model of inter- and intra-node communication. An extensive experimental evaluation on several HPC machines shows that our algorithms are up to two orders of magnitude faster in running time than a (sequential) high-quality general graph mapping tool, while obtaining similar results in communication performance. Furthermore, our algorithms also achieve significantly better mapping quality compared to previous state-of-the-art Cartesian grid mapping algorithms. This results in up to a threefold performance improvement of an MPI_Neighbor_alltoall exchange operation. Our new algorithms can be used to implement the MPI_Cart_create functionality.Comment: 18 pages, 9 Figure

arXiv.org e-Print Archive

Crossref

reposiTUm (TUW Vienna)

DReAM: Dynamic Re-arrangement of Address Mapping to Improve the Performance of DRAMs

Author: Garside Jim
Ghasempour Mohsen
Jaleel Aamer
Luján Mikel
Publication venue
Publication date: 12/09/2015
Field of study

The initial location of data in DRAMs is determined and controlled by the 'address-mapping' and even modern memory controllers use a fixed and run-time-agnostic address mapping. On the other hand, the memory access pattern seen at the memory interface level will dynamically change at run-time. This dynamic nature of memory access pattern and the fixed behavior of address mapping process in DRAM controllers, implied by using a fixed address mapping scheme, means that DRAM performance cannot be exploited efficiently. DReAM is a novel hardware technique that can detect a workload-specific address mapping at run-time based on the application access pattern which improves the performance of DRAMs. The experimental results show that DReAM outperforms the best evaluated address mapping on average by 9%, for mapping-sensitive workloads, by 2% for mapping-insensitive workloads, and up to 28% across all the workloads. DReAM can be seen as an insurance policy capable of detecting which scenarios are not well served by the predefined address mapping

arXiv.org e-Print Archive

Crossref

The University of Manchester - Institutional Repository

Maximum Likelihood Decoder for Index Coded PSK Modulation for Priority Ordered Receivers

Author: Rajan B. Sundar
S. Divya U.
Publication venue
Publication date: 09/03/2017
Field of study

Index coded PSK modulation over an AWGN broadcast channel, for a given index coding problem (ICP) is studied. For a chosen index code and an arbitrary mapping (of broadcast vectors to PSK signal points), we have derived a decision rule for the maximum likelihood (ML) decoder. The message error performance of a receiver at high SNR is characterized by a parameter called PSK Index Coding Gain (PSK-ICG). The PSK-ICG of a receiver is determined by a metric called minimum inter-set distance. For a given ICP with an order of priority among the receivers, and a chosen

2^N

-PSK constellation we propose an algorithm to find (index code, mapping) pairs, each of which gives the best performance in terms of PSK-ICG of the receivers. No other pair of index code (of length

N

with

2^N

broadcast vectors) and mapping can give a better PSK-ICG for the highest priority receiver. Also, given that the highest priority receiver achieves its best performance, the next highest priority receiver achieves its maximum gain possible and so on in the specified order or priority.Comment: 9 pages, 6 figures and 2 table

arXiv.org e-Print Archive

Crossref

ePrints@IISc

Recommended from our members

Efficiently Mapping Linear Algebra to High-Performance Code

Author: Barthels Henrik
Bientinesi Paolo
Psarras Christos
Publication venue
Publication date: 01/01/2019
Field of study

Aware of the role that linear algebra plays in scientific applications, we investigate if/how matrix expressions can be efficiently evaluated with current high-level languages. On the one hand, the numerical linear algebra community has put a lot of effort in developing and optimizing a relatively small set of “universally” useful operations. These are packaged in libraries such as BLAS and LAPACK, and serve as building blocks for more complex computa- tions. On the other hand, the linear algebra expressions that arise in many domains are significantly more complex than those building blocks. We refer to the problem of expressing a linear algebra expression in terms of a set of available building blocks as the ”Linear Algebra Mapping Problem” (LAMP). In practice, users have two alternatives to solve a given LAMP: 1) either “manually”, by using C/C++ or FORTRAN in combination with explicit calls to BLAS & LAPACK 2) or “automatically” by using one of the high-level languages (or libraries) with an API that directly captures the expressions. In this presentation, we focus only on the latter. Specifically, we consider 6 languages (or libraries): Matlab, Julia, R, NumPy (Python), Eigen (C++), and Armadillo (C++), and carefully assess how effectively they translate linear algebra expressions to code, i.e., how well they solve LAMPs. We investigate a number of aspects that are critical for the efficient solution of a LAMP. These range from the most basic mapping problem “Given the expression A*B, does the language map it to a call to GEMM?”, to the optimal parenthesization, to the exploitation of properties, to the identification & elimination -if advantageous- of common sub-expressions, and more. Ultimately, the purpose of this study is to exhibit the core challenges related to the effective computation of linear algebra expressions, and to help the development of languages and libraries.Texas Advanced Computing Center (TACC

Texas ScholarWorks

Task mapping on a dragonfly supercomputer

Author: Coskun Ayse K.
Leung Vitus
Tuncer Ozan
Zhang Yijia
Publication venue
Publication date: 14/09/2017
Field of study

The dragonfly network topology has recently gained traction in the design of high performance computing (HPC) systems and has been implemented in large-scale supercomputers. The impact of task mapping, i.e., placement of MPI ranks onto compute cores, on the communication performance of applications on dragonfly networks has not been comprehensively investigated on real large-scale systems. This paper demonstrates that task mapping affects the communication overhead significantly in dragonflies and the magnitude of this effect is sensitive to the application, job size, and the OpenMP settings. Among the three task mapping algorithms we study (in-order, random, and recursive coordinate bisection), selecting a suitable task mapper reduces application communication time by up to 47%

Boston University Institutional Repository (OpenBU)

Unified bit-based probabilistic data association aided MIMO detection for high-order QAM constellations

Author: Hanzo Lajos
Lv Tiejun
Maunder Robert G.
Yang Shaoshi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/03/2011
Field of study

A unified Bit-based Probabilistic Data Association (B-PDA) detection approach is proposed for Multiple-Input Multiple-Output (MIMO) systems employing high-order rectangular Quadrature Amplitude Modulation (QAM). The new approach transforms the symbol detection process of QAM to a bit-based process by introducing a Unified Matrix Representation (UMR) of QAM. Both linear natural and nonlinear binary reflected Gray bit-to-symbol mappings are considered. With the aid of simulation results, we demonstrate that the linear natural mapping based B-PDA approach typically attained an improved detection performance (measured in terms of both Bit Error Ratio (BER) and Symbol Error Ratio (SER)) in comparison to the conventional symbol-based PDA aided MIMO detector, despite its dramatically reduced computational complexity. The only exception is that at low SNRs, the linear natural mapping based B-PDA is slightly inferior in terms of its BER to the conventional symbol-based PDA using binary reflected Gray mapping. Furthermore, the simulation results show that the linear natural mapping based B-PDA MIMO detector may approach the best-case performance provided by the nonlinear binary reflected Gray mapping based B-PDA MIMO detector under ideal conditions. Additionally, the implementation of the B-PDA MIMO detector is shown to be much simpler in the case of the linear natural mapping. Based on these two points, we conclude that in the context of the uncoded B-PDA MIMO detector it is preferable to use the linear natural bit-to-symbol mapping, rather than the nonlinear Gray mapping

Southampton (e-Prints Soton)

Crossref