1,475,733 research outputs found
Performance Analysis of a Novel GPU Computation-to-core Mapping Scheme for Robust Facet Image Modeling
Though the GPGPU concept is well-known
in image processing, much more work remains to be done
to fully exploit GPUs as an alternative computation
engine. This paper investigates the computation-to-core
mapping strategies to probe the efficiency and scalability
of the robust facet image modeling algorithm on GPUs.
Our fine-grained computation-to-core mapping scheme
shows a significant performance gain over the standard
pixel-wise mapping scheme. With in-depth performance
comparisons across the two different mapping schemes,
we analyze the impact of the level of parallelism on
the GPU computation and suggest two principles for
optimizing future image processing applications on the
GPU platform
Efficient Process-to-Node Mapping Algorithms for Stencil Computations
Good process-to-compute-node mappings can be decisive for well performing HPC
applications. A special, important class of process-to-node mapping problems is
the problem of mapping processes that communicate in a sparse stencil pattern
to Cartesian grids. By thoroughly exploiting the inherently present structure
in this type of problem, we devise three novel distributed algorithms that are
able to handle arbitrary stencil communication patterns effectively. We analyze
the expected performance of our algorithms based on an abstract model of inter-
and intra-node communication. An extensive experimental evaluation on several
HPC machines shows that our algorithms are up to two orders of magnitude faster
in running time than a (sequential) high-quality general graph mapping tool,
while obtaining similar results in communication performance. Furthermore, our
algorithms also achieve significantly better mapping quality compared to
previous state-of-the-art Cartesian grid mapping algorithms. This results in up
to a threefold performance improvement of an MPI_Neighbor_alltoall exchange
operation. Our new algorithms can be used to implement the MPI_Cart_create
functionality.Comment: 18 pages, 9 Figure
DReAM: Dynamic Re-arrangement of Address Mapping to Improve the Performance of DRAMs
The initial location of data in DRAMs is determined and controlled by the
'address-mapping' and even modern memory controllers use a fixed and
run-time-agnostic address mapping. On the other hand, the memory access pattern
seen at the memory interface level will dynamically change at run-time. This
dynamic nature of memory access pattern and the fixed behavior of address
mapping process in DRAM controllers, implied by using a fixed address mapping
scheme, means that DRAM performance cannot be exploited efficiently. DReAM is a
novel hardware technique that can detect a workload-specific address mapping at
run-time based on the application access pattern which improves the performance
of DRAMs. The experimental results show that DReAM outperforms the best
evaluated address mapping on average by 9%, for mapping-sensitive workloads, by
2% for mapping-insensitive workloads, and up to 28% across all the workloads.
DReAM can be seen as an insurance policy capable of detecting which scenarios
are not well served by the predefined address mapping
Maximum Likelihood Decoder for Index Coded PSK Modulation for Priority Ordered Receivers
Index coded PSK modulation over an AWGN broadcast channel, for a given index
coding problem (ICP) is studied. For a chosen index code and an arbitrary
mapping (of broadcast vectors to PSK signal points), we have derived a decision
rule for the maximum likelihood (ML) decoder. The message error performance of
a receiver at high SNR is characterized by a parameter called PSK Index Coding
Gain (PSK-ICG). The PSK-ICG of a receiver is determined by a metric called
minimum inter-set distance. For a given ICP with an order of priority among the
receivers, and a chosen -PSK constellation we propose an algorithm to find
(index code, mapping) pairs, each of which gives the best performance in terms
of PSK-ICG of the receivers. No other pair of index code (of length with
broadcast vectors) and mapping can give a better PSK-ICG for the highest
priority receiver. Also, given that the highest priority receiver achieves its
best performance, the next highest priority receiver achieves its maximum gain
possible and so on in the specified order or priority.Comment: 9 pages, 6 figures and 2 table
Recommended from our members
Efficiently Mapping Linear Algebra to High-Performance Code
Aware of the role that linear algebra plays in scientific applications, we investigate if/how matrix expressions can be efficiently evaluated with current high-level languages. On the one hand, the numerical linear algebra community has put a lot of effort in developing and optimizing a relatively small set of “universally” useful operations. These are packaged in libraries such as BLAS and LAPACK, and serve as building blocks for more complex computa- tions. On the other hand, the linear algebra expressions that arise in many domains are significantly more complex than those building blocks. We refer to the problem of expressing a linear algebra expression in terms of a set of available building blocks as the ”Linear Algebra Mapping Problem” (LAMP). In practice, users have two alternatives to solve a given LAMP: 1) either “manually”, by using C/C++ or FORTRAN in combination with explicit calls to BLAS & LAPACK 2) or “automatically” by using one of the high-level languages (or libraries) with an API that directly captures the expressions. In this presentation, we focus only on the latter. Specifically, we consider 6 languages (or libraries): Matlab, Julia, R, NumPy (Python), Eigen (C++), and Armadillo (C++), and carefully assess how effectively they translate linear algebra expressions to code, i.e., how well they solve LAMPs. We investigate a number of aspects that are critical for the efficient solution of a LAMP. These range from the most basic mapping problem “Given the expression A*B, does the language map it to a call to GEMM?”, to the optimal parenthesization, to the exploitation of properties, to the identification & elimination -if advantageous- of common sub-expressions, and more. Ultimately, the purpose of this study is to exhibit the core challenges related to the effective computation of linear algebra expressions, and to help the development of languages and libraries.Texas Advanced Computing Center (TACC
Task mapping on a dragonfly supercomputer
The dragonfly network topology has recently gained traction in the design of high performance computing (HPC) systems and has been implemented in large-scale supercomputers. The impact of task mapping, i.e., placement of MPI ranks onto compute cores, on the communication performance of applications on dragonfly networks has not been comprehensively investigated on real large-scale systems. This paper demonstrates that task mapping affects the communication overhead significantly in dragonflies and the magnitude of this effect is sensitive to the application, job size, and the OpenMP settings. Among the three task mapping algorithms we study (in-order, random, and recursive coordinate bisection), selecting a suitable task mapper reduces application communication time by up to 47%
Unified bit-based probabilistic data association aided MIMO detection for high-order QAM constellations
A unified Bit-based Probabilistic Data Association (B-PDA) detection approach is proposed for Multiple-Input Multiple-Output (MIMO) systems employing high-order rectangular Quadrature Amplitude Modulation (QAM). The new approach transforms the symbol detection process of QAM to a bit-based process by introducing a Unified Matrix Representation (UMR) of QAM. Both linear natural and nonlinear binary reflected Gray bit-to-symbol mappings are considered. With the aid of simulation results, we demonstrate that the linear natural mapping based B-PDA approach typically attained an improved detection performance (measured in terms of both Bit Error Ratio (BER) and Symbol Error Ratio (SER)) in comparison to the conventional symbol-based PDA aided MIMO detector, despite its dramatically reduced computational complexity. The only exception is that at low SNRs, the linear natural mapping based B-PDA is slightly inferior in terms of its BER to the conventional symbol-based PDA using binary reflected Gray mapping. Furthermore, the simulation results show that the linear natural mapping based B-PDA MIMO detector may approach the best-case performance provided by the nonlinear binary reflected Gray mapping based B-PDA MIMO detector under ideal conditions. Additionally, the implementation of the B-PDA MIMO detector is shown to be much simpler in the case of the linear natural mapping. Based on these two points, we conclude that in the context of the uncoded B-PDA MIMO detector it is preferable to use the linear natural bit-to-symbol mapping, rather than the nonlinear Gray mapping
- …
