444 research outputs found
Improving the Performance and Energy Efficiency of GPGPU Computing through Adaptive Cache and Memory Management Techniques
Department of Computer Science and EngineeringAs the performance and energy efficiency requirement of GPGPUs have risen, memory management techniques of GPGPUs have improved to meet the requirements by employing hardware caches and utilizing heterogeneous memory. These techniques can improve GPGPUs by providing lower latency and higher bandwidth of the memory. However, these methods do not always guarantee improved performance and energy efficiency due to the small cache size and heterogeneity of the memory nodes. While prior works have proposed various techniques to address this issue, relatively little work has been done to investigate holistic support for memory management techniques.
In this dissertation, we analyze performance pathologies and propose various techniques to improve memory management techniques. First, we investigate the effectiveness of advanced cache indexing (ACI) for high-performance and energy-efficient GPGPU computing. Specifically, we discuss the designs of various static and adaptive cache indexing schemes and present implementation for GPGPUs. We then quantify and analyze the effectiveness of the ACI schemes based on a cycle-accurate GPGPU simulator. Our quantitative evaluation shows that ACI schemes achieve significant performance and energy-efficiency gains over baseline conventional indexing scheme. We also analyze the performance sensitivity of ACI to key architectural parameters (i.e., capacity, associativity, and ICN bandwidth) and the cache indexing latency. We also demonstrate that ACI continues to achieve high performance in various settings.
Second, we propose IACM, integrated adaptive cache management for high-performance and energy-efficient GPGPU computing. Based on the performance pathology analysis of GPGPUs, we integrate state-of-the-art adaptive cache management techniques (i.e., cache indexing, bypassing, and warp limiting) in a unified architectural framework to eliminate performance pathologies. Our quantitative evaluation demonstrates that IACM significantly improves the performance and energy efficiency of various GPGPU workloads over the baseline architecture (i.e., 98.1% and 61.9% on average, respectively) and achieves considerably higher performance than the state-of-the-art technique (i.e., 361.4% at maximum and 7.7% on average). Furthermore, IACM delivers significant performance and energy efficiency gains over the baseline GPGPU architecture even when enhanced with advanced architectural technologies (e.g., higher capacity, associativity).
Third, we propose bandwidth- and latency-aware page placement (BLPP) for GPGPUs with heterogeneous memory. BLPP analyzes the characteristics of a application and determines the optimal page allocation ratio between the GPU and CPU memory. Based on the optimal page allocation ratio, BLPP dynamically allocate pages across the heterogeneous memory nodes. Our experimental results show that BLPP considerably outperforms the baseline and state-of-the-art technique (i.e., 13.4% and 16.7%) and performs similar to the static-best version (i.e., 1.2% difference), which requires extensive offline profiling.clos
Get Out of the Valley: Power-Efficient Address Mapping for GPUs
GPU memory systems adopt a multi-dimensional hardware structure to provide the bandwidth necessary to support 100s to 1000s of concurrent threads. On the software side, GPU-compute workloads also use multi-dimensional structures to organize the threads. We observe that these structures can combine unfavorably and create significant resource imbalance in the memory subsystem causing low performance and poor power-efficiency. The key issue is that it is highly application-dependent which memory address bits exhibit high variability.
To solve this problem, we first provide an entropy analysis approach tailored for the highly concurrent memory request behavior in GPU-compute workloads. Our window-based entropy metric captures the information content of each address bit of the memory requests that are likely to co-exist in the memory system at runtime. Using this metric, we find that GPU-compute workloads exhibit entropy valleys distributed throughout the lower order address bits. This indicates that efficient GPU-address mapping schemes need to harvest entropy from broad address-bit ranges and concentrate the entropy into the bits used for channel and bank selection in the memory subsystem. This insight leads us to propose the Page Address Entropy (PAE) mapping scheme which concentrates the entropy of the row, channel and bank bits of the input address into the bank and channel bits of the output address. PAE maps straightforwardly to hardware and can be implemented with a tree of XOR-gates. PAE improves performance by 1.31 x and power-efficiency by 1.25 x compared to state-of-the-art permutation-based address mapping
Locality-aware parallel block-sparse matrix-matrix multiplication using the Chunks and Tasks programming model
We present a method for parallel block-sparse matrix-matrix multiplication on
distributed memory clusters. By using a quadtree matrix representation, data
locality is exploited without prior information about the matrix sparsity
pattern. A distributed quadtree matrix representation is straightforward to
implement due to our recent development of the Chunks and Tasks programming
model [Parallel Comput. 40, 328 (2014)]. The quadtree representation combined
with the Chunks and Tasks model leads to favorable weak and strong scaling of
the communication cost with the number of processes, as shown both
theoretically and in numerical experiments.
Matrices are represented by sparse quadtrees of chunk objects. The leaves in
the hierarchy are block-sparse submatrices. Sparsity is dynamically detected by
the matrix library and may occur at any level in the hierarchy and/or within
the submatrix leaves. In case graphics processing units (GPUs) are available,
both CPUs and GPUs are used for leaf-level multiplication work, thus making use
of the full computing capacity of each node.
The performance is evaluated for matrices with different sparsity structures,
including examples from electronic structure calculations. Compared to methods
that do not exploit data locality, our locality-aware approach reduces
communication significantly, achieving essentially constant communication per
node in weak scaling tests.Comment: 35 pages, 14 figure
PointAtrousGraph: Deep Hierarchical Encoder-Decoder with Point Atrous Convolution for Unorganized 3D Points
Motivated by the success of encoding multi-scale contextual information for
image analysis, we propose our PointAtrousGraph (PAG) - a deep
permutation-invariant hierarchical encoder-decoder for efficiently exploiting
multi-scale edge features in point clouds. Our PAG is constructed by several
novel modules, such as Point Atrous Convolution (PAC), Edge-preserved Pooling
(EP) and Edge-preserved Unpooling (EU). Similar with atrous convolution, our
PAC can effectively enlarge receptive fields of filters and thus densely learn
multi-scale point features. Following the idea of non-overlapping max-pooling
operations, we propose our EP to preserve critical edge features during
subsampling. Correspondingly, our EU modules gradually recover spatial
information for edge features. In addition, we introduce chained skip
subsampling/upsampling modules that directly propagate edge features to the
final stage. Particularly, our proposed auxiliary loss functions can further
improve our performance. Experimental results show that our PAG outperform
previous state-of-the-art methods on various 3D semantic perception
applications.Comment: 11 pages, 10 figure
Goodness of the GPU Permutation Index: Performance and Quality Results
Similarity searching is a useful operation for many real applications that work on non-structured or multimedia databases. In these scenarios, it is significant to search similar objects to another object given as a query. There exist several indexes to avoid exhaustively review all database objects to answer a query. In many cases, even with the help of an index, it could not be enough to have reasonable response times, and it is necessary to consider approximate similarity searches. In this kind of similarity search, accuracy or determinism is traded for faster searches. A good representative for approximate similarity searches is the Permutation Index.
In this paper, we give an implementation of the Permutation Index on GPU to speed approximate similarity search on massive databases. Our implementation takes advantage of the GPU parallelism. Besides, we consider speeding up the answer time of several queries at the same time.
We also evaluate our parallel index considering answer quality and time performance on the different GPUs. The search performance is promising, independently of their architecture, because of careful planning and the correct resources use.Workshop: WBDMD - Base de Datos y Minería de DatosRed de Universidades con Carreras en Informátic
A GPU implementation of the Correlation Technique for Real-time Fourier Domain Pulsar Acceleration Searches
The study of binary pulsars enables tests of general relativity. Orbital
motion in binary systems causes the apparent pulsar spin frequency to drift,
reducing the sensitivity of periodicity searches. Acceleration searches are
methods that account for the effect of orbital acceleration. Existing methods
are currently computationally expensive, and the vast amount of data that will
be produced by next generation instruments such as the Square Kilometre Array
(SKA) necessitates real-time acceleration searches, which in turn requires the
use of High Performance Computing (HPC) platforms. We present our
implementation of the Correlation Technique for the Fourier Domain Acceleration
Search (FDAS) algorithm on Graphics Processor Units (GPUs). The correlation
technique is applied as a convolution with multiple Finite Impulse Response
filters in the Fourier domain. Two approaches are compared: the first uses the
NVIDIA cuFFT library for applying Fast Fourier Transforms (FFTs) on the GPU,
and the second contains a custom FFT implementation in GPU shared memory. We
find that the FFT shared memory implementation performs between 1.5 and 3.2
times faster than our cuFFT-based application for smaller but sufficient filter
sizes. It is also 4 to 6 times faster than the existing GPU and OpenMP
implementations of FDAS. This work is part of the AstroAccelerate project, a
many-core accelerated time-domain signal processing library for radio
astronomy.Comment: 20 pages, 9 figures. Accepted for publication in ApJ
- …